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CHAPTER 1 


Introduction to Machine 
Learning and R 





Beginners to machine learning are often confused with the plethora of algorithms 

and techniques being taught in subjects like statistical learning, data mining, artificial 
intelligence, soft computing, and data science. Naturally they end up asking, how 

these subjects are different and which is best for solving real-world problems? There 

is substantial overlap in these subjects and it's hard to draw a clear Venn diagram 
explaining the differences. Primarily, the foundation for these subjects is derived from 
probability and statistics. Machine learning played a pivotal role in transforming statistics 
into amore accessible subject by showing the applications to the real-world problems. 
However, many statisticians probably won't agree with machine learning giving life to 
statistics, giving rise to the never-ending chicken and egg conundrum kind of discussions. 
Fundamentally, without spending much effort in understanding the pros and cons of this 
discussion, it’s wise to believe that the power of statistics needed a pipeline to flow across 
different industries with some challenging problems to be solved and machine learning 
simply established that high-speed and friction-less pipeline. The other subjects that 
evolved from statistics and machine learning are simply trying to broaden the scope of 
these two subjects and putting it into a bigger banner. 

Except for statistical learning, which is generally offered by mathematics or statistics 
departments in the majority of the universities across the globe, the rest are taught by 
computer science department. In the recent years, this separation is disappearing but 
the collaboration between the two departments is still not complete. Programmers are 
intimidated by the complex theorems and proofs and statisticians hate talking (read as 
coding) to machines all the time. But as more industries are becoming data and product 
driven, the need for getting the two departments to speak a common language is strongly 
emphasized. Roles in industry are suitably revamped to create openings like machine 
learning engineers, data engineers, and data scientists into a broad group being called the 
data science team. 
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CHAPTER 1 ™ INTRODUCTION TO MACHINE LEARNING AND R 


The purpose of this chapter is to take one step back and demystify the terminologies 
as we travel through the history of machine learning and emphasize that putting the ideas 
from statistics and machine learning into practice by broadening the scope is critical. 

At the same time, we elaborate on the importance of learning the fundamentals of 
machine learning with an approach inspired by the contemporary techniques from data 
science. We have simplified all the mathematics to as much extent as possible without 
compromising the fundamentals and core part of the subject. The right balance of 
statistics and computer science is always required for understanding machine learning, 
and we have made every effort for our readers to appreciate the elegance of mathematics, 
which at times is perceived by many to be hard and full of convoluted definitions, 
theories, and formulas. 


1.1 Understanding the Evolution 


The first challenge anybody finds when starting to understand how to build intelligent 
machines is how to mimic human behavior in many ways or, to put it even more 
appropriately, how to do things even better and more efficiently than humans. Some 
examples of these things performed by machines are identifying spam e-mails, predicting 
customer churn, classifying documents into respective categories, playing chess, 
participating in jeopardy, cleaning house, playing football, and much more. Carefully 
looking at these examples will reveal that we humans haven't perfected these tasks to date 
and rely heavily on machines to help us. So, now the question remains, where do you start 
learning to build such intelligent machines? Often, depending on which task you want to 
take up, experts will point you to machine learning, artificial intelligence (AI), or many 
such subjects, that sound different by name but are intrinsically connected. 

In this chapter, we have taken up the task to knit together this evolution and finally 
put forth the point that machine learning, which is the first block in this evolution, is 
where you should fundamentally start to later delve deeper into other subjects. 


1.1.1 Statistical Learning 


The whitepaper, Discovery with Data: Leveraging Statistics with Computer Science to 
Transform Science and Society by American Statistical Association (ASA) [1], published 
in July 2014, pointed out rightly, “Statistics as the science of learning from data, and of 
measuring, controlling, and communicating uncertainty is the most mature of the data 
sciences.’ They also added, over the last two centuries, and particularly the last 30 years 
with the ability to do large-scale computing, this discipline has been an essential part 
of the social, natural, bio-medical, and physical sciences, engineering, and business 
analytics, among others. Statistical thinking not only helps make scientific discoveries, 
but it quantifies the reliability, reproducibility, and general uncertainty associated with 
these discoveries. This excerpt from the whitepaper is very precise and powerful in 
describing the importance of statistics in data analysis. 

Tom Mitchell, in his article, “The Discipline of Machine Learning [2],” appropriately 
points out, “Over the past 50 years, the study of machine learning has grown from the 
efforts of a handful of computer engineers exploring whether computers could learn to 
play games, and a field of statistics that largely ignored computational considerations, 
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to a broad discipline that has produced fundamental statistical-computational theories of 
learning processes.” 

This learning process has found its application in a variety of tasks for commercial 
and profitable systems like computer vision, robotics, speech recognition, and many 
more. At large, it’s when statistics and computational theories are fused together that 
machine learning emerges as a new discipline. 


1.1.2 Machine Learning (ML) 


The Samuel Checkers-Playing Program, which is known to be the first computer program 
that could learn, was developed in 1959 by Arthur Lee Samuel, one of the fathers of 
machine learning. Followed by Samuel, Ryszard S. Michalski, also deemed as a father of 
machine learning, came out with a system for recognizing handwritten alphanumeric 
characters, working along with Jacek Karpinski in 1962-1970. The subject from then has 
evolved with many facets and led the way for various applications impacting businesses 
and society for the good. 

Tom Mitchell defined the fundamental question machine learning seeks to answer 
as, “How can we build computer systems that automatically improve with experience, 
and what are the fundamental laws that govern all learning processes?” He further 
explains, the defining question of computer science is, “How can we build machines 
that solve problems, and which problems are inherently tractable/intractable?” whereas 
statistics focus on answering “What can be inferred from data plus a set of modeling 
assumptions, with what reliability?” 

This set of questions clearly show the difference between statistics and machine 
learning. As mentioned earlier in the chapter, it might not even be necessary to deal with 
the chicken and egg conundrum as we clearly see one simply complements the other 
and is paving the path for future. As we dive deep into the concepts of statistics and 
machine learning, you will see the differences clearly emerging out or at times completely 
disappearing. Another line of thought, in the paper “Statistical Modeling: The Two 
Cultures” by Leo Breiman in 2001 [3], argued that statisticians rely too heavily on data 
modeling, and that machine learning techniques are instead focusing on the predictive 
accuracy of models. 


1.1.3 Artificial Intelligence (AI) 


The AI world from very beginning was intrigued by games. Whether it be checkers, chess, 
Jeopardy, or the recently very popular Go, the AI world strives to build machines that can 
play against humans to beat them in these games and it has received much accolades 

for the same. IBM’s Watson beat the two best players of Jeopardy, a quiz game show, 
wherein participants compete to come out with their responses as a phrase in the form 
of questions to some general knowledge clues in the form of answers. Considering the 
complexity in analyzing natural language phrases in these answers, it was considered to 
be very hard for machines to compete with humans. A high-level architecture of IBM's 
DeepQA used in Watson looks something like in Figure 1-1. 


CHAPTER 1 ™ INTRODUCTION TO MACHINE LEARNING AND R 


Evidence 
| soumes 


Candidate Supporting Deep 
answer evidence evidence 
generation retrieval scConng 


à P 
n ë 
5 + 
n F 
A a 
= h e 
Soft Hypothesis and 
hibening evidence scoring 


Soft Hypothesis and 
filtering evidence scoring 


Primary 
search 












Question Query 
analysis decomposition 


_| Hypothesis 
| generation 









Synthesis 










Final merging 
and ranking 










| Hypothesis 
| generation 


ANS wer 
and 
confidence 







Figure 1-1. Architecture of IBM's DeepQA 


Al also sits at the core of robotics. The 1971 Turing Award winner, John McCarthy, 
a well known American computer scientist, was believed to have coined this term and 
in his article titled, “What Is Artificial Intelligence?” he defined it as “the science and 
engineering of making intelligent machines [4]” So, if you relate back to what we said 
about machine learning, we instantly sense a connection between the two, but AI goes 
the extra mile to congregate a number of sciences and professions, including linguistics, 
philosophy, psychology, neuroscience, mathematics, and computer science, as well as 
other specialized fields such as artificial psychology. It should also be pointed out that 
machine learning is often considered to be a subset of AI. 


1.1.4 Data Mining 


Knowledge Discovery and Data Mining (KDD), a premier forum for data mining, states 

its goal to be advancement, education, and adoption of the “science” for knowledge 
discovery and data mining. Data mining, like ML and AI, has emerged as interdisciplinary 
subfield of computer science and for this reason, KDD commonly projects data mining 
methods, as the intersection of AI, ML, statistics, and database systems. Data mining 
techniques were integrated into many database systems and business intelligence tools, 
when adoption of analytic services were starting to explode in many industries. 

The research paper, “WEKA Experiences with a Java open-source project” [5] (WEKA 
is one of the widely adapted tools for doing research and projects using data mining), 
published in the Journal of Machine Learning Research talked about how the classic book 
Data Mining: Practical machine learning tools and techniques with Java, [6] being originally 
named just Practical Machine Learning, and the term data mining was only added for 
marketing reasons. Eibe Frank and Mark A. Hall who wrote this research paper are the two 
co-authors of the book, so we have a strong rationale to believe this reason for the name 
change. Once again, we see fundamentally, ML being in the core of data mining. 


CHAPTER 1 ™ INTRODUCTION TO MACHINE LEARNING AND R 


1.1.5 Data Science 


It’s not wrong to call data science a big umbrella that brought everything with a potential 
to show insight from data and build intelligent systems inside it. In the book, Data 
Science for Business [7], Foster Provost and Tom Fawcett introduced the notion of viewing 
data and data science capability as a strategic asset, which will help businesses think 
explicitly about the extent to which one should invest in them. In a way, data science has 
emphasized the importance of data more than the algorithms of learning. 

It has established a well defined process flow that says, first think about doing 
descriptive data analysis and then later start to think about modeling. As a result of 
this, businesses have started to adopt this new methodology because they were able to 
relate to it. Another incredible change data science has brought is around creating the 
synergies between various departments within a company. Every department has their 
own subject matter experts and data science teams have started to build their expertise 
in using data as a common language to communicate. This paradigm shift has witnessed 
the emergence of data driven growth and many data products. Data science has given us 
a framework, which aims to create a conglomerate of skill sets, tools and technologies. 
Drew Conway, the famous American data scientist who is known for his Venn diagram 
definition of data science as shown in Figure 1-2, has very rightly placed machine 
learning in the intersection of Hacking Skills and Math & Statistics Knowledge. 





Substantive 
Expertise 


Figure 1-2. Venn diagram definition of data science 


We strongly believe the fundamentals of these different field of study are all derived 
from statistics and machine learning but different flavors, for reasons justifiable in its own 
context, were given to it, which helped the subject to get molded into various systems and 
areas of research. This book will help trim down the number of different terminologies being 
used to describe the same set of algorithms and tools. It will present a simple-to-understand 
and coherent approach, the algorithms in machine learning and its practical use with R. 
Wherever it’s appropriate, we will emphasize the need to go outside the scope of this book 
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and guide our readers with the relevant materials. By doing so, we are re-emphasizing 
the need for mastering traditional approaches in machine learning and, at the same time, 
staying abreast with the latest development in tools and technologies in this space. 

Our design of topics in this book are strongly influenced by data science framework 
but instead of wandering through the vast pool of tools and techniques you would find 
in the world of data science, we have kept our focus strictly on teaching practical ways of 
applying machine learning algorithms with R. 

The rest of this chapter is organized to help readers understand the elements 
of probability and statistics and programming skills in R. Both of these will form the 
foundations for understanding and putting machine learning into practical use. The 
chapter ends with discussion of technologies that apply ML to a real-world problem. Also, 
a generic machine learning process flow will be presented showing how to connect the 
dots, starting from a given problem statement to deploying ML models to working with 
real-world systems. 


1.2 Probability and Statistics 


Common sense and gut-instincts play a key role for policy makers, leaders, and 
entrepreneurs in building nations and large enterprises. The big question is, how do 
we convert these immeasurable human decision-making traits into more objective 
measurable quantity to be able to take better decisions? That's where probability and 
statistics come in. Much of statistics is focused on analyzing existing data and drawing 
suitable conclusions using probability models. Though it's very common to use 
probabilities in many statistical modeling, we feel it’s important to identify the different 
questions probability and statistics help us answer. An example from the book, Learning 
Statistics with R: A Tutorial for Psychology Students and Other Beginners by Daniel 
Navarro [8], University of Adelaide, helps us understand it much better. Consider these 
two pairs of questions: 


1. What are the chances of a fair coin coming up heads 10 times 
in arow? 


2. Ifmy friend flips a coin 10 times and gets 10 heads. Is she 
playing a trick on me? 


and 


1. How likely it is that five cards drawn from a perfectly shuffled 
deck will all be hearts? 


2. Iffive cards off the top of the deck are all hearts, how likely is it 
that the deck was shuffled? 


In case of a coin toss, the first question could be answered if we know the coin is fair, 
there's a 50% chance that any individual coin flip will come up heads, in probability 
notation, P(heads)=0.5. So, our probability P(heads 10 times in a row) 


=.0009765625 (since all the 10 coin tosses are independent of each other, we can simply 
compute (0.5)! to arrive at this value). The probability value .0009765625 quantifies the 
chances of a fair coin coming up heads 10 times in a row. 
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On the other side, such a small probability would mean the occurrence of the event 
(heads 10 times in a row) is very rare, which helps to infer that my friend is playing 
some trick on me when she got all heads. Think about this—does tossing a coin 10 times 
give you strong evidence for doubting your friend? Maybe no; you may ask her to repeat 
the process several times. More the data we generate, the better will be the inference. The 
second set of question has the same thought process but is applied to a different problem. 

So, fundamentally, probability could be used as a tool in statistics to help us answer 
many such real-world questions using a model. We will explore some basics of both these 
worlds, and it will become evident that both converge at a point where it’s hard to observe 
many differences between the two. 


1.2.1 Counting and Probability Definition 


If we perform a random experiment like tossing three coins, there could be number of 
possible outcomes. Figure 1-3 shows a basic illustration of this experiment, with three 
coins, a total of eight possible outcomes (HHH, HHT, HTH, HTT, THH, THT, TTH, and 
TTT) are present. This set is called the sample space. 
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Figure 1-3. Sample space of three-coin tossing experiment 


Though it’s easy to count, the total number of possible outcomes in such a simple 
example with three coins, but as the size and complexity of problem increases, manually 
counting is not an option. A more formal approach is to use combinations and 
permutations. If the order is of significance, we call it a permutation; otherwise, generally 
the term combination is used. For instance, if we say, it doesn't matter which coin gets 
heads or tails out of the three coins, we are only interested in number of heads, which is 
like saying there is no significance for the order, then our total number of possible 
combination will be {HHH, HHT, HTT, TTT}. This means HHT and HTH are both same, 
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since there are two heads on these outcomes. A more formal way to obtain the number of 
possible outcome is shown in Table 1-1. It’s easy to see for the value n =2 (heads and 


tails) and k=3 (three coins), we get eight possible permutations and four combinations. 


Table 1-1. Permutation and Combinations 


Permutation Combination 
With Replacement (” ref he j 


k 
Without Replacement n! 


n 
(n—k)! w 


Once the sample space is known, it’s easy to define any events for which we would 
like to calculate probability. Suppose, we are interested in the event, E = tossing two heads: 


number of outcomes favourable to E 4 


P(Two heads ) = 0.5 


total number of outcomes 8 


This way of calculating the probability using the counts or frequency of occurrence 
is also know as the frequentist probability. There is another class called the Bayesian 
probability or conditional probability, which we will explore later in the chapter. 


1.2.2 Events and Relationships 


In the previous section, we saw an example of an event. Let’s go a step further and set a 
formal notion around various events and its relationship with each other. 


1.2.2.1 Independent Events 


A and B are independent if occurrence of A gives no additional information about 
whether B occurred. Imagine that Facebook enhances their Nearby Friends feature, 

and tells you the probability of your friend visiting the same cineplex for a movie in 

the weekends where you frequent. In the absence of such a feature in Facebook, the 
information that you are a very frequent visitor to this cineplex doesn't really increase or 
decrease the probability of you meeting your friend at the cineplex. This is because the 
events, A, you visiting the cineplex for a movie and B, your friend visiting the cineplex for 
a movie, are independent. 

On the other hand, if such a feature exists, we can't deny you would try your best to 
increase or decrease your probability of meeting your friend depending upon if he or she 
is close to you or not. And this is only possible because the two events are now linked by a 
feature in Facebook. 
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In the commonly used set theory notations, A and B (both have a non-zero 
probability) are independent iff (read as if and only if) one of the following equivalent 
statements holds: 


1. Probability of event A and B occurring at the same time is equal 
to the product of probability of event A and probability of event B 
P(AMB)=P(A)P(B) 
where, n represent intersection of the two events and probability of A given B. 


2. Probability of event A given B has already occurred, is equal to 
probability of A 


P(A|B)=P(A) 


3. Similarly, probability of event B given A has already occurred, 
is equal to probability of B 


P(B|A)=P(B) 


For the event A = Tossing two heads, and event B = Tossing head on first coin, so, 
P(AMB) =3/8=0.375 whereas P(A)P(B) =4/8*4/8=0.25 which is not equal to 


P(A7B). Similarly, the other two conditions can also be validated. 


1.2.2.2 Conditional Independence 


In the Facebook Nearby Friends example, we were able to ascertain that the probability of 
you and your friend both visiting the cineplex at the same time has to do something with 
your location and intentions. Though intentions are very hard to quantify, it’s not the case 
with location. So, if we define the event C to be, being in a location near to cineplex, then 
it's not difficult to calculate the probability. But even when you both are nearby, it’s not 
necessary that you and your friend would visit the cineplex. More formally, this is where 
we define conditionally, A and B are independent given Cif P(A ~B|C)=P(A|C)P(B|C). 


Note here that independence does not imply conditional independence, and 
conditional independence does not imply independence. It’s in a way saying, A and B 
together are independent of another event, C. 


1.2.2.3 Bayes Theorem 


On the contrary, if A and B are not independent but rather information about A reveals 
some detail about B or vice versa, we would be be interested in calculating P(A | B) , read 


as probability of A given B. This has a profound application in modelling many real-world 
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problems. The widely used form of such conditional probability is called Bayes Theorem 
(or Bayes Rule). Formally, for events A and B, the Bayes Theorem is represented as: 


pain) =P 


where, P(B)#0, P(A) is then called a prior probability and P(A|B) is called posterior 


probability, which is the measure we get after an additional information B is known. Let's 
look at the Table 1-2, a two-way contingency table for our Facebook Nearby example to 
explain this better. 


Table 1-2. Facebook Nearby Example—Two-Way Contingency Table 


Visit Cineplex 
Didn't Visit Cineplex 
Total 








So, if we would like to know P( Visiting Cineplex | Nearby), in other words, the 


probability of your friend visiting the cineplex given he or she is nearby (within one mile) 
the cineplex. A word of caution, we are saying the probability of your friend visiting the 
cineplex not the probability of you meeting the friend. The latter would be little more 
complex to model, which we would skip here to keep our focus intact on Bayes Theorem. 
Now, assuming we know the historical data (let’s say, the previous month) about your 
friend as shown in the Table 1-2, we know: 


P( Visit Cineplex | Nearby ) = 2) =0.83 


This means, in the previous month, your friend was ten times within one mile (nearby) 
of the cineplex and visited it. Also, there have been two instances when he was nearby but 
didn’t visit the cineplex. Alternatively, we could have calculated the probability as: 


P (Visit Cineplex | Nearby) = ee Cmepten) 
earby 


e 
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This example is based on the two-way contingency table and provides a good 
intuition around conditional probability. We will deep dive into the machine learning 
algorithm called Naive Bayes as applied to a real-world problem, which is based on Bayes 
Theorem, later in Chapter 6. 


1.2.3 Randomness, Probability, and Distributions 


David S. Moore et. al’s book, Introduction to the Practice of Statistics [9], is an easy-to- 
comprehend book with simple mathematics, but conceptually rich ideas from statistics. 
It very aptly points out, “Random” in statistics is not a synonym for “haphazard” but a 
description of a kind of order that emerges in the long run.” They further explain, we often 
deal with unpredictable events in our life on daily basis that we generally term as random, 
like the example of Facebook's Nearby Friends, but we rarely see enough repetition of 
the same random phenomenon to observe the long-term regularity that probability 
describes. 

In this excerpt from the book, they capture the essence of randomness, probability, 
and distributions very concisely. 


We call a phenomenon random if individual outcomes are uncertain but 
there is nonetheless a regular distribution of outcomes in a large number of 
repetitions. The probability of any outcome of a random phenomenon is the 
proportion of times the outcome would occur in avery long series of repetitions. 


This leads us to define a random variable that stores such random phenomenon 
numerically. In any experiment involving random events, a random variable, say X, based 
on the outcomes of the events will be assigned a numerical value. And the probability 
distribution of X helps in finding the probability for a value being assigned to X. 

For example, if we define, X = {number of head in three coin tosses}, then X can take 
values 0, 1, 2, and 3. Here we call X a discrete random variable. However, if we define X = 
{all values between 0 and 2}, there can be infinitely many possible values, so X is called a 
continuous random variable. 


par (mfrow=¢(1, 2) ) 


X Values <-c(0,1,2,3) 

X Props <-e¢(1/8,3/8,3/8,1/8) 

barplot(X Props, names.arg=X Values, ylim=e(0,1), xlab =" Discrete RV X 
Values", ylab ="Probabilities") 


x  <-seq(0,2, length=1000) 

y  <-dnorm(x,mean=1, sd=0.5) 

plot(x,y, type="1", lwd=1, ylim=c(0,1),xlab ="Continuous RV X Values", ylab 
="Probabilities") 


The above code will plot the distribution of X, a typical probability distribution 


function will look like in Figure 1-4. The second plot showing continuous distribution 
is anormal distribution with mean = 1 and standard deviation = 0.5. It’s also called the 
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probability density function. Don't worry if you are not familiar with these statistical 
terms; we will explore on these in much detail later in the book. For now, it is enough to 
understand the random variable and what we mean by its distribution. 
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Figure 1-4. Probability distribution with discrete and continuous random variable 


1.2.4 Confidence Interval and Hypothesis Testing 


Suppose you were running a socioeconomic survey for your state among a chosen 
sample from the entire population (assuming it’s chosen totally at random). As the data 
starts to pour in, you feel excited and at the same time, a little confused on how you 
should analyze the data. There could be many insights that can come from data and it’s 
possible that every insight may not be completely valid, as the survey is only based ona 
small randomly chosen sample. 

Law of Large Numbers (more detailed discussion on this topic in Chapter 3) in 
statistics tells us that the sample mean must approach population mean as the sample 
size increases. In other words, we are saying it’s not required that you survey each and 
every individual in your state but rather choose a sample large enough to be a close 
representative of the entire population. Even though measuring uncertainty gives us 
power to make better decisions, in order to make our insights statistically significant, we 
need to create a hypothesis and perform certain tests. 
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1.2.4.1 Confidence Interval 


Let’s start by understanding the confidence interval. Suppose that a 10-yearly census 
survey questionnaire contains information on income levels. And say, in the year 
2005, we find that for the sample size of 1000, repeatedly chosen from the population, 
the sample mean x follows the normal distribution with population mean p and 
standard error o / Vn. If we know the standard deviation, o, to be $1500, then 

O.= a =47.4. 


* A1000 


Now, in order to define confidence interval, which generally takes a form like 





estimate + margin of error 


A 95% confidence interval (CI) is twice the standard error (also called margin of 
error) plus or minus the mean. In our example, suppose the x =990 dollars and standard 


deviation as computed is $47.4, then we would have a confidence interval (895.2,1084.8) 
i.e. 990 +2*47.4. If we repeatedly choose many samples, each would have a different 


confidence interval but statistics tells us 95% of the time, CI will contain the true 
population mean p. There are other stringent CIs like 99.7% but 95% is a golden standard 
for all practical purposes. Figure 1-5 shows 25 samples and the CIs. The normal 
distribution of the population helps to visualize the number of CIs where the estimate p 
wasn't contained in the CI; in this figure, there is only one such CI. 
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Density curve of X 





Figure 1-5. Confidence interval 


1.2.4.2 Hypothesis Testing 


Hypothesis testing is sometimes also known as a test of significance. Although Cl is a strong 
representative of the population estimate, we need a more robust and formal procedure for 
testing and comparing an assumption about population parameters of the observed data. 
The application of hypothesis is wide spread, starting from assessing what’s the reliability 
of asample used in a survey for an opinion poll to finding out the efficacy of a new drug 
over an existing drug for curing a disease. In general, hypothesis tests are tools for checking 
the validity of a statement around certain statistics relating to an experiment design. If you 
recall, the high-level architecture of IBM's DeepQA has an important step called hypothesis 
generation in coming out with the most relevant answer for a given question. 

The hypothesis testing consists of two statements that are framed on the population 
parameter, one of which we want to reject. As we saw while discussing CI, the sampling 
distribution of the sample mean x follows a normal distribution N( L,0 / Jn . One of 
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most important concepts is the Central Limit Theorem (a more detailed discussion on this 
topic is in Chapter 3), which tells us that for large samples, the sampling distribution is 
approximately normal. Since normal distribution is one of the most explored 
distributions with all of its properties well known, this approximation is vital for every 
hypothesis test we would like to perform. 

Before we perform the hypothesis test, we need to construct a confidence level of 
90%, 95%, or 99%, depending on the design of the study or experiment. For doing this, we 
need a number z*, also referred to as the critical value, so that normal distribution has a 
defined probability of 0.90, 0.95, or 0.99 within +-z* standard deviation of its mean. Table 
1.x below shows the value of z* for different confidence interval. Note that in our example 
in the section 1.2.4.1, we approximated z* = 1.960 for 95% confidence interval to 2. 





Figure 1-6. The z* score and confidence level 


In general, we could choose any value of z* to pick the appropriate confidence level. 
With this explanation, let’s take our income example from the census data for the year 
2015. We need to find out how the income has changed over the last 10 years, i.e., from 
2005 to 2015. In the year 2015, we find the estimate of our mean value for income as 
$2300. The question to ask here would be, since both the values $900 (in the year 2005) 
and $2300 are estimates of the true population mean (in other words, we have taken a 
representative sample but not the entire population to calculate this mean) but not the 
actual mean, do these observed means from sample provide the evidence to conclude the 
income has increased? We might be interested in calculating some probability to answer 
this question. Let’s see how we can formulate this in a hypothesis testing framework. A 
hypothesis test starts with designing two statements like so: 


H, : There is no difference in the mean income or true mean income 


H, : The true mean incomes are not the same 


Abstracting the details at this point, the consequence of the two statements would 
simply lead toward accepting H, or rejecting it. In general, the null hypothesis is always 
a statement of “no difference” and the alternative statement challenges this null. A more 
numerically concise way of writing these two statements would be: 


H, :Sample Mean x =0 


H, : Sample Mean x + 0 


In case we reject H , we have two choices to make, whether we want to test X>0, 
x <0 or simply x #0, without bothering much about direction, which is called two-side 


test. If you are clear about the direction, a one-side test is preferred. 
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Now, in order to perform the significance test, we would understand the 
standardized test statistics z, which is defined as follows: 


_ estimate —hypothesized value 
standard deviation of the estimate 


Alternatively: 
_ X7H, 


ner 





Substituting the value 1400 for the estimate of the difference of income between the 
year 2005 and 2015, and 1500 for standard deviation of the estimate (this SD is computed 
with the mean of all the samples drawn from the population), we obtain 


The difference in income between 2005 and 2015 based on our sample is $1400, 
which corresponds to 0.93 standard deviations away from zero (z = 0.93). Because we 
are using a two-sided test for this problem, the evidence against null hypothesis, H, is 
measured by the probability that we observe a value of Z as extreme or more extreme 
than 0.93. More formally, this probability is 


P(Z < —0.93 or Z > 0.93) 

where Z has the standard normal distribution N(0, 1). This probability is called p-value. 
We will use this value quite often in regression models. 

From standard z-score table, the standard normal probabilities, we find: 

P(Z>0.93) =1—0.8238 = 0.1762 
Also, the probability for being extreme in the negative direction is the same: 
P(Z<-0.93) =0.1762 
Then, the p-value becomes: 
P=2P(Z>0.93)=2*(0.1762) =0.3524 

Since the probability is large enough, we have no other choice but to stick with our null 
hypothesis. In other words, we don't have enough evidence to reject the null hypothesis. 
It could also be stated as, there is 35% chance of observing a difference as extreme as the 


$1400 in our sample if the true population difference is zero. A note here, though; there 
could be numerous other ways to state our result, all of it means the same thing. 
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Finally, in many practical situations, it’s not enough to say that the probability is 
large or small, but instead it’s compared to a significance or confidence level. So, if we are 
given a 95% confidence interval (in other words, the interval that includes the true value 
of u with 0.95 probability), values of p that are not included in this interval would be 
incompatible with the data. Now, using this threshold a =0.05 ( 95% confidence) , we 


observe the P-value is greater than 0.05 (or 5%), which mean, we still do not have enough 
evidence to reject H,. Hence, we conclude that there is no difference in the mean income 
between the year 2005 and 2015. 

There are many other ways to perform hypothesis testing, which we leave for the 
interested readers to refer to detailed text on the subject. Our major focus in the coming 
chapters is to do hypothesis testing using R for various applications in sampling and 
regression. 

We introduce the field of probability and statistics, both of which form the 
foundation of data exploration and our broader goal of understanding the predictive 
modeling using machine learning. 


1.3 Getting Started with R 


Ris GNU S, a freely available language and environment for statistical computing and 
graphics that provides a wide variety of statistical and graphical techniques: linear and 
nonlinear modeling, statistical tests, time series analysis, classification, clustering, and lot 
more than what you could imagine. 

Although covering the complete topics of R is beyond the scope of this book, we will 
keep our focus intact by looking at the end goal of this book. The getting started material 
here is just to provide the familiarity to readers who don't have any previous exposure 
to programming or scripting languages. We strongly advise that the readers follow R's 
official web site for instructions on installing and some standard textbook for more 
technical discussion on topics. 


1.3.1 Basic Building Blocks 


This section provides a quick overview of the building blocks of R, which uniquely 
makes R the most sought out programming language among statisticians, analysts, and 
scientists. R is an easy-to-learn and an excellent tool for developing prototype models 
very quickly. 


1.3.1.1 Calculations 


As you would expect, R provides all arithmetic operations you would find in a scientific 
calculator and much more. All kind of comparisons like >, >=, <, and <=, and functions 
such as acos, asin, atan, ceiling, floor, min, max, cumsum, mean, and median are readily 
available for all possible computations. 


18 


CHAPTER 1 ™ INTRODUCTION TO MACHINE LEARNING AND R 


1.3.1.2 Statistics with R 


R is one such language that’s very friendly to academicians and people with less 
programming background. The ease of computing statistical properties of data has 

also given it a widespread popularity among data analyst and statisticians. Functions 
are provided for computing quantile, rank, sorting data, and matrix manipulation like 
crossprod, eigen, and svd. There are also some really easy-to-use functions for building 
linear models quite quickly. A detailed discussion on such models will follow in later 
chapters. 


1.3.1.3 Packages 


The strength of R lies with its community of contributors from various domains. The 
developers bind everything in one single piece called a package, in R. A simple package 
can contain few functions for implementing an algorithm or it can be as big as the base 
package itself, which comes with the R installers. We will use many packages throughout 
the book as we cover new topics. 


1.3.2 Data Structures in R 


Fundamentally, there are only five types of data structures in R, and they are most often 
used. Almost all other data structures are built upon these five. Hadley Wickham, in his 
book Advanced R [10], provided an easy-to-comprehend segregation of these five data 


structures, as shown in Table 1-3. 


Table 1-3. Data Structures in R 


et Type 
Dimension 
1d 
2d 


nd N/A 





Some other data structures derived from these five and most commonly used are 
listed here: 


e Factors: This one is derived from a vector 
e Data tables: This one is derived from a data frame 


The homogeneous type allows for only a single data type to be stored in vector, 
matrix, or array, whereas the Heterogeneous type allows for mixed types as well. 
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1.3.2.1 Vectors 


Vectors are the simplest form of data structure in R and yet very useful. Each vector stores 
all elements of same type. This could be thought as a one-dimensional array, similar to 
those found in programming languages like C/C++ 

car_ name <-¢("Honda", "BMW", "Ferrari" ) 

car color =e("Black", "Blue", "Red") 

car_cc =¢(2000, 3400, 4000) 


1.3.2.2 List 


Lists internally in R are collection of generic vectors. For instance, a list of automobiles 
with name, color, and cc could be defined as a list named cars, with a collection of 
vectors named name, color, and cc inside it. 
cars <-list(name =¢("Honda","BMW","Ferrari"), 
color =¢("Black","Blue","Red"), 
cc =¢(2000, 3400, 4000) ) 
cars 
$name 
[1] "Honda" "BMW" "Ferrari" 


$color 
[1] "Black" "Blue" "Red" 


$cc 
[1] 2000 3400 4000 


1.3.2.3 Matrix 


Matrixes are the data structures that store multi-dimensional arrays with many rows and 
columns. For all practical purposes, its data structure helps store data in a format where 

every row represents a certain collection of columns. The columns hold the information 
that defines the observation (row). 


mdat <-matrix(c(1,2,3, 11,12,13), nrow =2, ncol =3, byrow =TRUE, 
dimnames =list(c("row1", "row2"), 
e("C.1", "C.2", "C.3"))) 
mdat 
C.1 C.2 C.3 
row1 1 2 3 
row2 11 12 13 
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1.3.2.4 Data Frame 


Data frames extend matrixes with the added capability of holding heterogeneous types of 
data. In a data frame, you can store character, numeric, and factor variables in different 
columns of the same data frame. In almost every data analysis task, with rows and 
columns of data, data frame comes as a natural choice for storing the data. The following 
example shows how numeric and factor columns are stored within the same data frame. 


L3 <-LETTERS[1:3] 
fac <-sample(L3, 10, replace =TRUE) 
df <-data.frame(x =1, y =1:10, fac = fac) 


class (df$x) 
[1] "numeric" 

class (df$y) 
[1] "integer" 

class(df$fac) 
[1] "factor" 


1.3.3 Subsetting 


R has one of the most advanced, powerful, and fast subsetting operators compared to 

any other programming language. It’s powerful to an extent that, except for few cases 
which we will discuss in the next section, there is no looping construct like for or while 
required, even though R explicitly provides one if needed. Though its very powerful, 
syntactically it could sometime turn out to be an nightmare or gross error could pop up 

if careful attention is not paid in placing the required number of parentheses, brackets, 
and commas. The operators |, [[, and $ are used for subsetting, depending on which data 
structure is holding the data. It’s also possible to combine subsetting with assignment to 
perform some really complicated function with very few lines of code. 


1.3.3.1 Vectors 


For vectors, the subsetting could be done by referring to the respective index of the 
elements stored in a vector. For example, car_name[c(1,2) ] will return elements stored 

in index 1 and 2, and car_name[-2] returns all the elements except for second. It’s also 
possible to use binary operators to instruct the vector to retrieve or not retrieve an element. 


car name <-¢("Honda", "BMW", "Ferrari") 


#Select 1st and 2nd index from the vector 
car_name[¢(1,2) | 
[1] "Honda" "BMW" 
#Select all except 2nd index 
car_name[-2] 
[1] "Honda" "Ferrari" 
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#Select 2nd index 
car_name[e¢(FALSE, TRUE, FALSE) | 
[1] "I BMW" 


1.3.3.2 Lists 


Subsetting in lists is similar to subsetting in a vector; however, since a list is a collection 
of many vectors, you must use double square brackets to retrieve a element from the list. 
For example, cars[2] retrieves the entire second vector of the list and cars[[c(2,1)]] 
retrieves the first element of the second vector. 


cars <-list(name =¢("Honda","BMW","Ferrari"), 
color =e("Black","Blue","Red"), 
cc =¢€(2000, 3400, 4000) ) 


#Select the second list with cars 
cars[2] 
$color 
[1] "Black" "Blue" "Red" 
#select the first element of second list in cars 
cars[[e(2,1)]] 
[1] "Black" 


1.3.3.3 Matrixes 


Matrixes have a similar subsetting as vectors. However, instead of specifying one index to 
retrieve the data, we need two index here—one that signifies the row and the other for the 
column. For example, mdat[ 1:2, ] retrieves all the columns of the first two rows, whereas 
mdat[1:2,”C.1”] retrieves the first two rows and the C.1 column. 


mdat <-matrix(c(1,2,3, 11,12,13), nrow =2, ncol =3, byrow =TRUE, 
dimnames =list(¢("row1", "row2"), 
¢("C.1", "ed 4 "C.3"))) 


#Select first two rows and all columns 
mdat[1:2, | 
C.1 C.2 C.3 
rowî 1 2 3 
row2 11 12 13 
#Select first columns and all rows 


mdat[,1:2] 
C1 C2 

row1 1 2 

row2 11 12 
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#Select first two rows and first column 
mdat[1:2,"C.1"] 
row1 row2 
1 11 
#Select first row and first two columns 
mdat[1,1:2] 
C.1 C.2 
1 2 


1.3.3.4 Data Frames 


Data frames work similarly to matrixes, but they have far more advanced subsetting 
operations. For example, it’s possible to provide conditional statements like df$fac == 
“A”, which will retrieve only rows where the column fac has a value A. The operator $ is 
used to refer to a column. 


L3 <-LETTERS[1:3] 
fac <-sample(L3, 10, replace =TRUE) 
df <-data.frame(x =1, y =1:10, fac = fac) 


#Select all the rows where fac column has a value "A" 
df[df$fac=="A", | 


CON DU N 


10110 A 
#Select first two rows and all columns 
df[¢(1,2),] 

x y fac 

111 B 

212 A 
#Select first column as a vector 
df$x 

[1] 1111111111 


1.3.4 Functions and Apply Family 


As the standard definition goes, functions are the fundamental building blocks of any 
programming language and R is no different. Every single library in R has a rich set of 
functions used to achieve a particular task without writing same piece of code repeatedly. 
Rather, all that is required is a function call. The following simple example is a function 
that returns the nth root of a number with two arguments, num and nroot, and contains a 
function body for calculating the nth root of a real positive number. 
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nthroot <-function(num, nroot) { 
return (num *(1/nroot) ) 


nthroot(8, 3) 
[1] 2 


This example is a user-defined function, but there are so many such functions across 
the vast collection of packages contributed by R community worldwide. We will next 
discuss a very useful function family from the base package of R, which has found its 
application in numerous scenarios. 

The following description and examples are borrowed from The New S Language by 
Becker, R. A. et al. [11] 


e lapply returns a list of the same length as of input X, each 
element of which is the result of applying a function to the 
corresponding element of X. 


e sapply is a user-friendly version and wrapper of lapply by 
default returning a vector, matrix or, if you use simplify = "array", 
an array if appropriate. Applying simplify2array(). sapply(x, 
f, simplify = FALSE, USE.NAMES = FALSE) is the same as 
lapply(x, f). 


e vapply is similar to sapply, but has a prespecified type of return 
value, so it can be safer (and sometimes faster) to use. 


e tapply applies a function to each cell of a ragged array, that is to 
each (non-empty) group of values given by a unique combination 
of the levels of certain factors. 


#Generate some data into a variable x 
x <-list(a =1:10, beta =exp(-3:3), logic =e(TRUE, FALSE, FALSE, TRUE) ) 


#Compute the list mean for each list element using lapply 
lapply(x, mean) 

$a 

[1] 5.5 


$beta 
[1] 4.535125 


$logic 
[1] 0.5 


#Compute the quantile(0%, 25%, 50%, 75% and 100%) for the three elements of x 
sapply(x, quantile) 
a beta logic 
0% 1.00 0.04978707 0.0 
25% 3.25 0.25160736 0.0 
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50% 5.50 1.00000000 0.5 
754 7.75 5.05366896 1.0 
100% 10.00 20.08553692 1.0 


#Generate some list of elements using sapply on sequence of integers 
i39 <-sapply(3:9, seq) # list of vectors 


#Compute the five number summary statistic using sapply and vapply with the 
function fivenum 


sapply(i39, fivenum) 


[51] [,2] [,3] [,4] [5] [,6] [,7] 
[1,] 1.0 1.0 i 1.0 1.0 1.0 1 
[2,] 1.5 1.5 2 2.0 2.5 2.5 3 
[3;] 20 225 3 3.5 4.0 4.5 5 
[4,] 2.5 3.5 4 5.0 5.5 6.5 7 
[5,] 3.0 4.0 5 6.0 7.0 8.0 9 
vapply(i39, fivenum,e(Min. =0, "1st Qu." =0, Median =0, "3rd Qu." =0, Max. =0)) 
[51] [,2] [,3] [54] [55] [,6] [57] 
Min. 1.0 1.0 1 1.0 1.0 1.0 1 
dst Qu. 1.5 1.5 2 2.0 2.5 2.5 3 
Median 2.0 2.5 3 3.5 4.0 4.5 5 
3rd Qu. 2.5 3.5 4 5.0 5.5 6.5 7 
Max. 3.0 4.0 5 6.0 7.0 8.0 9 


#Generate some 5 random number from binomial distribution with repetitions 
allowed 
groups <-as.factor(rbinom(32, n =5, prob =0.4)) 


#Calculate the number of times each number repeats 
tapply(groups, groups, length) #- is almost the same as 
7 11 12 13 

1 1 1 2 

#The output is similar to the function table 
table(groups) 

groups 

7 11 12 13 

1 1 1 2 


As you can see, every operation in the list involves a certain logic which needs a 
loop (for or while loop) like traversal on the data. However, by using the apply family of 
functions, we can reduce writing programming codes to a minimum and instead call a 
single-line function with the appropriate arguments. It’s functions like these that make R 
the most preferred programming language for even less experienced programmers. 
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1.4 Machine Learning Process Flow 


In the real world, every use case has a different modeling need, so it’s hard to present a 
very generic process flow that explains how you should build machine learning model 

or data product. However, it’s possible to suggest best practice guidelines around the 

key milestones for any modeling projects in industry. Figure 1-7 depicts one such best 
practice guideline we have built after many years of research and suits the contemporary 
world of data science, where ideas are translated into data products. Throughout this 
book, we will refer to this machine learning process flow, as the topics in this book are 
coherently arranged based on this process flow. 


| PLAN | EXPLORE | BUILD | EVALUATE | 











Figure 1-7. Machine leaning process flow 


The process flow has four main phases, which we will from here on refer to as PEBE, 
Plan, Explore, Build and Evaluate, as shown in the Figure 1-7. Let’s get into the details of 
each of these. 


1.4.1 Plan 


This phase forms the key component of the entire process flow. A lot of energy and 

effort needs to be spent on understanding the requirements, identifying every data 
source available at our disposal and framing an approach for solving the problems being 
identified from the requirements. While gathering data is at core of the entire process 
flow, considerable effort has to be spent in cleaning the data for maintaining the integrity 
and veracity of the final outputs of the analysis and model building. We will discuss many 
approaches for gathering various types of data and cleaning them up in Chapter 2. 


1.4.2 Explore 


Exploration sets the ground for analytic projects to take flight. A detailed analysis of 
possibilities, insights, scope, hidden patterns, challenges, and errors in the data are first 
discovered at this phase. A lot of statistical and visualization tools are employed to carry 
out this phase. In order to allow for greater flexibility for modification if required in later 
parts of the project, this phase is divided into two parts. The first is a quick initial analysis 
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that’s carried out to assess the data structure, including checking naming conventions, 
identifying duplicates, merging data, and further cleaning the data if required. Initial data 
analysis will help identify any additional data requirement, which is why you see a small 
leap of feedback loop built in to the process flow. 

In the second part, a more rigorous analysis is done by creating hypotheses, 
sampling data using various techniques, checking the statistical properties of the sample, 
and performing statistical tests to reject or accept the hypotheses. Chapters 2, 3, and 4 
discuss these topics in detail. 


1.4.3 Build 


Most of the analytic projects either die out in the first or second phase; however, the one 
that reaches this phase, has a great potential to be converted into a data product. This 
phase requires a careful study of whether a machine learning kind of model is required 
or a simple descriptive analysis done in the first two phases is more than sufficient. In the 
industry, unless you don't show a ROI on effort, time, and money required in building a 
ML model, the approval from the management is hard to come by. And since, many ML 
algorithms are kind of a blackbox where at times, the output is difficult to interpret, the 
business rejects them outright in the very beginning. 

So, if you pass all these criteria and still decide to build the ML model, then comes 
time to understand the technicalities of each algorithm and how it works on a particular 
set of data, which we will take up in Chapter 6. Once the model is build, it’s always good 
to ask if the model satisfies your findings in the initial data analysis. If not, then it’s 
advisable to take a small leap of feedback loop. 

One reason you see Build Data Product in the process flow before the evaluation 
phase is to have a minimal viable output directed toward building a data product (not a 
full fledged product, but it could even be a small Excel sheet presenting all the analysis 
done until this point). We are essentially not suggesting that you always build a ML 
model, but it could even be a descriptive model that articulates the way you approached 
the problem and present the analysis. This approach helps with the evaluation phase, 
whether the model is good enough to be considered for building a more futuristic 
predictive model (or a data product) using ML or whether there still is a scope for 
refinement or whether this should be dropped completely. 


1.4.4 Evaluate 


This phase determines either the rise of another revolutionary disruption in the 
traditional scheme of things or the disappointment of starting from scratch once again. 
The big leap of feedback loop is sometimes unavoidable in many real-world projects 
because of the complexity it carries or the inability of data to answer certain questions. If 
you have diligently followed all the steps in the process flow, it’s likely that you may just 
want to further spend some effort in tuning the model rather than taking the big leap to 
start all from the scratch. 

It’s highly unlikely that you can build a powerful ML model in just one iteration. We 
will explore in detail all the criteria for evaluating the model’s goodness in Chapter 7 and 
further fine-tune the model in Chapter 8. 
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1.5 Other Technologies 


While we place a lot of emphasis on the key role played by programming languages 

and technologies like R in simplifying many ML process flow tasks which otherwise 

are complex and time consuming, it would not be wise to ignore the other competing 
technologies in the same space. Python is another preferred programming language that 
has found quite a good traction in the industry for building production-ready ML process 
flows. There is an increased demand for algorithms and technologies with capabilities 

of scaling ML models or analytical tasks to a much larger dataset and executing them at 
real-time speed. The later part needs a much more detailed discussion on big data and 
related technologies, which is beyond the scope of this book. 

Chapter 9, in a nutshell, will talk about such scalable approaches and other 
technologies that can helps you build the same ML process flows with robustness and 
using industry standards. However, do remember that every approach/technology has 
its own pros and cons, so wisely deciding the right choice before the start of any analytic 
project is vital for the successful completion. 


1.6 Summary 


In this chapter, you learned about the evolution of machine learning from statistics to 
contemporary data science. We also looked at the fundamental subjects like probability 
and statistics, which form the foundations of ML. You had an introduction to the R 
programming language, with some basic demonstrations in R. We concluded the chapter 
with the machine learning process flow the PEBE framework. 

In the coming chapters, we will go into the details of data exploration for a better 
understanding and take a deep dive into some real-world datasets. 
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CHAPTER 2 


Data Preparation and 
Exploration 


As we emphasized in our introductory chapter on applying machine learning (ML) 
algorithms with a simplified process flow, in this chapter, we go deeper into the first block 
of machine learning process flow—data exploration and preparation. 

The subject of data exploration was very formally introduced by John W. Tukey 
almost four decades ago with his book on Exploratory Data Analysis (EDA). The methods 
discussed in the book were profound and there aren’t many software programs that 
include all of it. Tukey put forth certain very effective ways for exploring data that 
could prove very vital in understanding the data before building the machine learning 
models. There are a wide variety of books, articles, and software codes that explain data 
exploration, but we will focus our attention on techniques that help us look at the data 
with more granularity and bring useful insights to aid us in model building. Tukey defined 
data analysis in 1961 as: 


Procedures for analyzing data, techniques for interpreting the results 
of such procedures, ways of planning the gathering of data to make its 
analysis easier, more precise or more accurate, and all the machinery and 
results of (mathematical) statistics which apply to analyzing data.[1] 


We will decode this entire definition in detail throughout this chapter but essentially, 
data exploration at large involves looking at the statistical properties of data and wherever 
possible, drawing some very appealing visualizations to reveal certain not so obvious 
patterns. In a broad sense, calculating statistical properties of data and visualization go 
hand-in-hand, but we have tried to give separate attention in order to bring out the best 
of both. Moreover, this chapter will go beyond data exploration and cover the various 
techniques available for preparing the data more suitable for the analysis and modeling, 
which includes imputation of missing data, removing outliers, and adding derived 
variables. This data preparation procedure is normally called initial data analysis (IDA). 

This chapter also explores the process of data wrangling to prepare and transform 
the data. Once the data is structured, we could think about various descriptive statistics 
which explain the data more insightfully. In order to build the basic vocabulary for 
understanding the language of data, we discuss first the basic types of variables, 
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data formats, and the degree of cleanliness. And then, the entire data wrangling process 
will be explained followed by descriptive statistics. The chapter ends with demonstrations 
using R. The examples help in seeing the theories taking a practical shape with real-world 
examples. 

Broadly speaking, the chapter will focus on IDA and EDA. Even though we have a 
chapter dedicated to data visualization, which plays a pivotal role in understanding the 
data, EDA will give visualization its due emphasis in this chapter. We attempt to decode 
Tukey's definition of data analysis with a contemporary view. 


2.1 Planning the Gathering of Data 


The data in the real world can be in numerous types and formats. It could be structured 
or unstructured, readable or obfuscated, and small or big; however, having a good plan 
for data gathering keeping in mind the end goal, will prove to be beneficial and will save 
a lot of time during data analysis and predictive modeling. Such a plan needs to include a 
lot of information around variable types, data formats, and source of data. We describe 

in this section many fundamentals to understanding the types, formats, and sources of 
that data. 

A lot of data nowadays is readily available, but a true data-driven industry will always 
have a strategic plan for making sure the data is gathered the way they want. Ideas from 
Business Analytics (BI) can help in designing data schemas, cubes, and many insightful 
reports, but our focus is on laying a very general framework from understanding the 
nuances of datatypes to identifying the sources of data. 


2.1.1 Variables Types 


In general, we have two basic types of variables in any given data, categorical and 
continuous. Categorical variables include the qualitative attributes of the data such as 
gender or country name. Continuous variables are quantitative, for example, the salary of 
employees in a company. 


2.1.1.1 Categorical Variables 


Categorical variables can be classified into Nominal, Dichotomous, and Ordinal. 
We explain each type in a little more detail. 


e Nominal 


These are variables with two or more categories without any 
regard for ordering. For example, in polling data from a survey, 
the variable state, or candidate names. The number of states 
and candidates are definite and it doesn't matter what order we 
choose to present our data. In other words, the order of state or 
candidate name has no significance in its relative importance in 
explaining the data. 
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e Dichotomous 


A special case of nominal variables with exactly two categories 
such as gender, possible outcomes of a single coin toss, a survey 
questionnaire with a checkbox for telephone number as mobile or 
landline, or the outcome of election win or loss (assuming no tie ). 


° Ordinal 


Just like nominal variables, we can have two or more categories in 
ordinal variables with an added condition that the categories are 
ordered. For example, a customer rating for a movie in Netflix or a 
product in Amazon. The variable rating has a relative importance 
on a scale of 1 to 5, 1 being the lowest rating and 5 the highest for 
a movie or product by a particular customer. 


2.1.1.2 Continuous Variables 


Continuous variables are subdivided into Interval and Ratio: 
e Interval 


The basic distinction is that they can be measured along a 
continuous range and they have a numerical value. For example, 
the temperature in degrees Celsius or Fahrenheit is an interval 
variable. Note here that the temperature at 0° C is not the absolute 
zero, which simply means 0° C has certain degree of temperature 
measure than just saying the value means none or no measure. 


° Ratio 


In contrast, ratio variables include distance, mass, and height. 
Ratio reflects the fact that you can use the ratio of measurements. 
So, for example, a distance of 10 meters is twice the distance of 5 
meters. A value 0 for a ratio variable means a none or no measure. 


2.1.2 Data Formats 


Increasing digital landscapes and diversity in software systems has led to the plethora 

of file formats available for encoding the information or data in a computer file. There 

are many data formats that are accepted as the gold standard for storing information 

and have widespread usage, independent of any software, but yet there are many other 
formats in use, generally because of the popularity of a given software package. Moreover, 
many data formats specific to scientific applications or devices are also available. 

In this section, we discuss the commonly used data formats and show 
demonstrations using R for reading, parsing, and transforming the data. The basic 
datatypes in R as described in Chapter 1—like vectors, matrices, data frames, list, and 
factors—will be used throughout this chapter for all demonstrations. 
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2.1.2.1 Comma-Separated Values 


CSV or TXT is one of the most widely used data exchange formats for storing tabular 
data containing many rows and columns. Depending on the data source, the rows and 
columns have a particular meaning associated with them. Typical information looks 
like the following example of employee data in a company. In R, the read.csv function 
is widely used to read such data. The argumentssep specifies the delimiting character 
and header takes TRUE or FALSE, depending on whether the dataset contains the column 
names or not. 


read.csw("employees.csv", header =TRUE, sep =",") 
Code First.Name Last.Name Salary.US.Dollar. 


1 15421 John Smith 10000 
2 15422 Peter Wolf 20000 
3 15423 Mark Simpson 30000 
4 15424 Peter Buffet 40000 
5 15425 Martin Luther 50000 


2.1.2.2 Microsoft Excel 


Microsoft Excel file format (.xls or .xlsx) has been undisputedly the most popular data 

file format in the business world. The primary purpose being the same as CSV files, but 
Excel files offer many rich mathematical computations and elegant data presentation 
capabilities. Excel features calculation, graphing tools, pivot tables, and a macro 
programming language called Visual Basic for Applications. The programming feature of 
Excel has been utilized by many industries to automate their data analysis and manual 
calculations. This wide traction of Excel has resulted in many data analysis software 
programs that provide an interface to read the Excel data. There are many ways to read an 
Excel file in R, but the most convenient and easy way is to use the package xlsx. 


library (xlsx) 
read.xlsx("employees.xlsx",sheetName ="Sheet1") 
Code First.Name Last.Name Salary.US.Dollar. 


1 15421 John Smith 10000 
2 15422 Peter Wolf 20000 
3 15423 Mark Simpson 30000 
4 15424 Peter Buffet 40000 
5 15425 Martin Luther 50000 


2.1.2.3 Extensible Markup Language: XML 


Markup languages have a very rich history of evolution by their first usage by William W. 
Tunnicliffe in 1967 for presentation at a conference. Later Charles Goldfarb formalized 
the IBM Generalized Markup Language between the year 1969 and 1973. Goldfarb is 
more commonly regarded as the father of markup languages. Markup languages have 
seen many different forms, including TeX, HTML, XML, and XHTML and are constantly 


34 


CHAPTER 2 ™ DATA PREPARATION AND EXPLORATION 


being improved to suit numerous applications. The basics for all these markup language 
is to provide a system for annotating a document with a specific syntactic structure. 
Adhering to a markup language while creating documents ensures that the syntax is not 
violated and any human or software reader knows exactly how to parse the data in a given 
document. This feature of markup languages has found a wide range of usage, starting 
from designing configuration files for setting up software in a machine to employing them 
in communications protocols. 

Our focus here is the Extensible Markup Language widely known as XML. There 
are two basics constructs in any markup language, the first is markup and the second is 
the content. Generally, strings that create a markup either start with the symbol < and 
end with a >, or they start with the character & and end with a ;. Strings other than these 
characters are generally the content. There are three important markup types—tag, 
element, and attribute. 


e Tags 


A tag is a markup construct that begins with < and ends with >. It 
has three types. 


— Start tags: <employee id> 
— End tags: </employee_ id> 
— Empty tags: </> 


e Elements 


Elements are the components that either begin with a start tag 
and end with an end tag, both are matched while parsing, or 
contain only an empty element tag. An example of element: 


e <employee_id>John </employee_id> 
e <employee name type=”permanent”/> 
e Attribute 


Within start tag or empty element tag, an attribute is a markup 
construct consisting of a name/Vvalue pair. In the following 
example, the element designation has two attributes, emp_id and 
emp name. 


e <designation emp _id="15421" emp_name="John"> 
Assistant Manager </designation> 


e <designation emp id="15422" emp_name="Peter’ > 
Manager </designation> 


Consider the following example XML file storing the information on athletes in a 
marathon. 


<marathon> 
<athletes> 
<name>Mike</name> 
<age>25</age> 
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<awards> 

Two times world champion. Currently, worlds No. 3 
</awards> 

<titles>6</titles> 

</athletes> 

<athletes> 

<name>Usain< /name> 

<age>29</age> 

<awards> 

Five time world champion. Currently, worlds No. 1 
</awards> 

<titles>17</titles> 

</athletes> 

</marathon> 


Using the package XML and plyr in R, you can convert this file into a data. frame 
as follows: 


library (XML) 
library(plyr) 


xml:data <-xmlToList("marathon.xml") 


#Excluding "description" from print 
Idply(xml:data, function(x) { data.frame(x[!names(x)=="description"]) } ) 
.id name age 
1 athletes Mike 25 
2 athletes Usain 29 
awards titles 
1 \n Two times world champion. Currently, worlds No. 3\n 6 
2 \n Five times world champion. Currently, worlds No. 1\n 17 


2.1.2.4 Hypertext Markup Language: HTML 


Hypertext Markup Language, commonly known as HTML, is used to create web 
pages. A HTML page, when combined with Cascading Style Sheets (CSS), can produce 
beautiful static web pages. The web page can further be made dynamic and interactive 
by embedding a script written in language such as JavaScript. Today's modern web sites 
are a combination of HTML, CSS, JavaScript, and many more advanced technologies like 
Flash players. Depending on the purpose, a web site could be made rich with all these 
elements. Even when the modern web sites are getting more sophisticated, the core 
HTML design of web pages still stands against the test of times with newer features and 
advanced functionality. Although HTML is now filled with rich style and elegance, the 
content still remains very central. 

The ever-exploding number of web sites has made it difficult for someone to find 
relevant content on a particular subject of interest and that's where companies like 
Google saw a need to crawl, scrape, and rank web pages for relevant content. This process 
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generates a lot of data, which Google uses to help users with their search queries. These 
kind of process exploits the fact that HTML pages have a distinctive structure for storing 
content and reference links to external web pages if any. 

There are five key elements in a HTML file that are scanned by the majority of web 
crawlers and scrappers: 


Headers 


A simple header might look like 


<head> 
<title>Machine Learning with R </title> 
</head> 


Headings 


There are six heading tags, h1 to h6, with decreasing font sizes. 
They looks like 


<h1>Header </h1> 
<h2>Headings </h2> 
<h3>Tables </h3> 
<h4>Anchors </h4> 
<h5>Links </h5> 
<h6>Links </h6> 


Paragraphs 


A paragraph could contain more than few words or sentences in a 
single block. 


<p>Paragraph 1</p> 
<p>Paragraph 2</p> 


Tables 


A tabular data of rows and columns could be embedded into 
HTML tables with following tags 


<table> 

<tbody> 

<thead> 

<tr>Define a row </tr> 

</thead> 

</tbody> 

</table> 

<table> Tag declares the main table structure. 
<tbody> Tag specifies the body of the table. 
<thead> Tag defines the table header. 

<tr> Tag defines each row of data in the table 
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e Anchor 


Designers can use anchors to anchor a URL to some text on a web 
page. When users view the web page in a browser, they can click 
the text to activate the link and visit the page. Here’s an example 


<a href="https://www.apress.com/"> | Welcome to Machine Learning usingR!</a> 
With all these elements, a sample HTML file will look like the following snippet. 


<!DOCTYPE html> 

<html> 

<body> 

<h1>Machine Learning usingR</h1> 

<p>Hope you having fun time reading this book !!</p> 
<hi>Chapter 2</h1> 

<h2>Data Exploration and Preparation</h2> 

<a href="https://www.apress.com/">Apress Website Link</a> 
</body> 

</html> 


Python is one of most powerful scripting languages for building web scraping tools. 
Although R also provides many packages to do the same job, it’s not very robust. Like 
Google, you can try to build a web-scrapping tool, which extracts all the links from a 
HTML web page like the one shown previously. We will show a very basic example of 
reading a local HTML file in R named html_example. html with the previous information. 
It’s easy to extend this to any web page. 


library (XML) 
url <- “html_example.html" 
doc <-htmlParse(url ) 
xpathSApply(doc, "//a/@href") 
href 
"https ://www.apress.com/" 


2.1.2.5 JSON 


JSON is a widely used data interchange format in many application programming 
interfaces like Facebook Graph API, Google Maps, and Twitter API. 

An example of a JSON file is when you get data from the Facebook API. It might 
also be used to contain profile information that can be easily shared across your system 
components using the simple JSON format. 


{ 
"data": [ 


{ 


"id": "A1 B1", 
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"from": { 
"name": "Jerry", "id": "G1" 
£ 
"message": "Hey! Hope you like the book so far", 
"actions": [ 


{ 

"name": "Comment", 

"Link": "http://www. facebook.com/A1/posts/B1" 
J, 
{ 

"name": "Like", 

"Link": “http://www. facebook.com/A1/posts/B1" 
} 


J» 
"type": "status", 
"created_time": "2016-08-02T00:24:41+0000", 
“updated time": "2016-08-02T00:24:41+0000" 
Jy 
{ 
"id": “A2 B2", 
"from": { 
"name": "Tom", "id": "G2" 


"message": "Yes. Easy to understand book", 
"actions": [ 


{ 

"name": "Comment", 

"Link": “http://www. facebook.com/A2/posts/B2" 
Jy 
{ 

"name": "Like", 

"Link": “http://www. facebook. com/A2/posts/B2" 
} 


|, 
"type": “status”, 
“created time": °2016-08-03T21:27:44+0000", 
“updated time": "2016-08-03T21:27:44+0000" 
} 
] 
} 


DATA PREPARATION AND EXPLORATION 


Using the library rjson, you can read such JSON files into R and convert the data into 
a data. frame. The following R code displays the first three columns of the data. frame. 


library (1json) 
url <- "json _fb.json" 
document <-fromJSON(file=url, method='C') 
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as.data.frame(document)[ ,1:3] 
data.id data.from.name data.from.id 
1 A1 B1 Jerry G1 


2.1.2.6 Other Formats 


Apart from all these widely used data formats, there are many more formats supported 
by R. It’s possible to read data directly from databases using ODBC connections. R also 
supports data formats from other data analytics software like SPSS, SAS, Stata, and 
MATLAB. 


2.1.3 Data Sources 


Depending on the source of data, the format could vary. At times, identifying the type and 
format of data is not very straight forward, but broadly classifying, we might gather data 
that has a clear structure. Some might be semi-structured and other might look like total 
junk. Data gathering at large is not just an engineering effort but a great skill. 


2.1.3.1 Structured 


Structured data is everywhere and it’s always the easiest to understand, represent, store, 
query, and process such data. So, if you dream of an ideal world, all data will have rows 
and columns stored in a tabular manner. The widespread development around the 
various business applications, database technology, business intelligent system, and 
spreadsheet tools has given rise to enormous amount of clean and good looking data. 
Every row and column is well defined within a connected schematic tables in a database. 
The data coming from CSV and Excel files generally has this structure built into it. We will 
show many examples of such data throughout the book to explain the relevant concepts. 


2.1.3.2 Semi-Structured 


Although structured data gives us plenty of scope to experiment and ease of usage, it’s not 
always possible to represent information in rows and column. The kind of data generated 
from Twitter and Facebook has significantly moved away from the traditional Relational 
Database Management System (RDBMS) paradigms, where everything has a predefined 
schema, to a world of NoSQL (Chapter 9 covers some of the NoSQL systems), where data 
is semi-structured. Both Twitter and Facebook rely heavily on JSON or BSON (Binary 
JSON). Databases like MongoDB and Cassandra store this kind of NoSQL data. 


2.1.3.3 Unstructured 


The biggest challenge in data engineering has been dealing with unstructured data like 
images, videos, web logs, and click stream. The challenge is pretty much in handling the 
volume and velocity of this data generation process on top of not finding any patterns. 
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Despite having many sophisticated software systems for handling such data, there are 
no defined processes for using this data in modeling or insight generation. Unlike semi- 
structured data, where we have many APIs and tools to process the data into a required 
format, here, a huge effort is spent in processing this data to a structured form. Big data 
technologies like Hadoop and Spark are often deployed for such purposes. There has 
been significant work on unstructured textual data generated from human interactions. 
For instance, Twitter Sentiment Analysis on tweets (covered in Chapter 6). 


2.2 Initial Data Analysis (IDA) 


Collection of data is the first mammoth task in any data science project, and it forms 

the first building block of machine learning process flow presented in Chapter 1. Once 
the data is ready, then comes what we call the primary investigation or more formally, 
Initial Data Analysis (IDA). IDA makes sure our data is clean, correct, and complete for 
further exploratory analysis. The process of IDA involves preparing the data with the right 
naming conventions and datatypes for the variables, checking for missing and outlier 
values, and merging data from multiple sources to develop one coherent data source for 
further EDA. Commonly IDA is referred to as data wrangling. 

It’s widely believed that data wrangling consumes a significant amount of time 
(approximately 50-80% of the effort) and it’s something that can't be overlooked. More 
than a painful activity, data wrangling is a crucial step in generating understanding and 
insights from data. It’s not mere a process of cleaning and transforming the data but it 
helps to enrich and validate the data before something serious is done with it. 

There are many thought processes around data wrangling; we will explain them 
broadly with demonstrations in R. 


2.2.1 Discerning a First Look 


The process of wrangling starts with a judicious and very shrewd look at your data from 
the start. This first look builds the intuition and understanding of patterns and trends. 
There are many useful functions in R that help you get a first grasp of your data in a quick 
and clear way. 


2.2.1.1 Function str() 


The str() function in R comes in very handy when you first look at the data. Referring 
back to the employee data we used earlier, the str() output will look something like 
what’s shown in the following R code snippet. The output shows four useful tidbits about 
the data. 


e The number of rows and columns in the data 
e Variable name or column header in the data 
e Datatype of each variable 


e Sample values for each variable 
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Depending on how many variables are contained in the data, spending a few minutes 
or an hour on this output will provide significant understanding of the entire dataset. 


emp <-read.csw("employees.csv", header =TRUE, sep =",") 


str(emp) 
‘data. frame’: 5 obs. of 4 variables: 
$ Code > int 15421 15422 15423 15424 15425 
$ First.Name : Factor w/ 4 levels “John","Mark",..: 14 2 4 3 
$ Last.Name : Factor w/ 5 levels "Buffet","Luther",..: 45 31 2 


$ Salary.US.Dollar.: int 10000 20000 30000 40000 50000 


2.2.1.2 Naming Convention: make.names() 


In order to be consistent with the variable names throughout the course of the analysis 
and preparation phase, it’s important that the variable names in the dataset follow the 
standard R naming conventions. The step is critical for two important reasons, 


e When merging the multiple datasets, it’s convenient if the 
common columns on which the merge happens have the same 
variable name. 


e It’s also a good practice with any programming language to have 
clean names (no spaces or special characters) for the variables. 


R has this function called make.names(). To demonstrate it, lets make our variable 
names dirty and then will use make. names to clean them up. Note that read.csv functions 
have a default behavior of cleaning the variable names before they loads the data into 
data. frame. But when we are doing many operations on data inside our program, it’s 
possible that the variable names will fall out of convention. 


#Manually overriding the naming convention 
names(emp) <-¢('Code', ‘First Name','Last Name', 'Salary(US Dollar) ') 


# Look at the variable name 


emp 
Code First Name Last Name Salary(US Dollar) 
1 15421 John Smith 10000 
2 15422 Peter Wolf 20000 
3 15423 Mark Simpson 30000 
4 15424 Peter Buffet 40000 
5 15425 Martin Luther 50000 


# Now let’s clean it up using make.names 
names(emp) <-make.names(names(emp) ) 


# Look at the variable name after cleaning 


emp 
Code First.Name Last.Name Salary.US.Dollar. 
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1 15421 John Smith 10000 
2 15422 Peter Wolf 20000 
3 15423 Mark Simpson 30000 
4 15424 Peter Buffet 40000 
5 15425 Martin Luther 50000 


2.2.1.3 Table(): Pattern or Trend 


Another reason it’s important to look at your data closely up-front is to look for some kind 
of anomaly or pattern in the data. Suppose we wanted to see if there were any duplicates 
in the employee data, or if we wanted to find a very common name among employees and 
reward them for a fun HR activity. These tasks are possible using the table() function. Its 
basic role is to show the frequency distribution in a one- or two-way tabular format. 


#Find duplicates 
table(emp$Code) 


15421 15422 15423 15424 15425 
1 1 1 1 1 
#Find common names 
table(emp$First.Name) 


John Mark Martin Peter 
1 1 1 2 


This clearly shows no duplicates and the name Peter appearing twice. These kind of 
patterns might be very useful to judge if the data has any bias for some variables, which 
will tie back to the final story we would want to write from the analysis. 


2.2.2 Organizing Multiple Sources of Data into One 


Often the data of our problem statements doesn't come from one place. A plethora of 
resources and abundance of information in the world always keep us thinking, is there 
data missing from whatever collection is available so far? We call it a tradeoff between 

the abundance of data and our requirements. Not all data is useful and not all our 
requirements will be met. So, when you believe there is no more data collection possible, 
the thought process goes around, how do you now combine all that you have into one 
single source of data? This process could be iterative in the sense that something needs to 
be added or deleted based on relevance. 


2.2.2.1 Merge and dplyr Joins 


The most useful operation while preparing the data is the ability to join or merge two 
different datasets into a single entity. This idea is easy to relate to the various joins of 
SQL queries. A standard text on SQL queries will explain the different forms of joins 
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elaborately, but we will focus on the function available in R. Let’s discuss the two 
functions in R, which help to join two datasets. 

We will use another extended dataset of our employee example where we have the 
department and educational qualification and merge it with the already existing dataset of 
employees. Let’s see the four very common type of joins using merge and dplyr. Though 
the output is same, there are many differences between merge and dplyr implementations. 
dplyr is somewhat regarded as more efficient than merge, but the merge() function merges 
two data frames by common columns or row names, or does other versions of database 
join operations, whereas dplyr provides a flexible grammar of data manipulation focused 
on tools for working with data frames (hence the d in the name). 


2.2.2.1.1 Using merge 


Inner Join: Returns rows where a matching value for the variable code is found in both the 
emp and emp-equal data frames. 


merge(emp, emp qual, by ="Code") 
Code First.Name Last.Name Salary.US.Dollar. Qual 


1 15421 John Smith 10000 Masters 
2 15422 Peter Wolf 20000 PhD 
3 15423 Mark Simpson 30000 PhD 


Left Join: Returns all rows from the first data frame even if a matching value for the 
variable Code is not found in the second. 


merge(emp, emp qual, by ="Code", all.x =TRUE) 
Code First.Name Last.Name Salary.US.Dollar. Qual 


1 15421 John Smith 10000 Masters 
2 15422 Peter Wolf 20000 PhD 
3 15423 Mark Simpson 30000 PhD 
4 15424 Peter Buffet 40000 <NA> 
5 15425 Martin Luther 50000 <NA> 


Right Join: Returns all rows from second data frame even if a matching value for the 
variable Code is not found in the first. 


merge(emp, emp qual, by ="Code", all.y =TRUE) 
Code First.Name Last.Name Salary.US.Dollar. Qual 


1 15421 John Smith 10000 Masters 
2 15422 Peter Wolf 20000 PhD 
3 15423 Mark Simpson 30000 PhD 
4 15426 <NA> <NA> NA PhD 
5 15429 <NA> <NA> NA Phd 
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Full Join: Returns all rows from the first and second data frame whether or not a 
matching value for the variable Code is found. 


merge(emp, emp qual, by ="Code", all =TRUE) 
Code First.Name Last.Name Salary.US.Dollar. Qual 


1 15421 John Smith 10000 Masters 
2 15422 Peter Wolf 20000 PhD 
3 15423 Mark Simpson 30000 PhD 
4 15424 Peter Buffet 40000 <NA> 
5 15425 Martin Luther 50000 <NA> 
6 15426 <NA> <NA> NA PhD 
7 15429 <NA> <NA> NA Phd 


Note in these outputs that if a match is not found, the corresponding values are filled 


with NA, which is nothing but a missing value. We will discuss later in IDA how to deal 
with such missing values. 


2.2.2.1.2 dplyr 
library(dplyr) 


Inner Join: Returns rows where a matching value for the variable Code is found in 
both the emp and emp-equal data frames. 


inner_join(emp, emp_qual, by ="Code") 
Code First.Name Last.Name Salary.US.Dollar. Qual 


1 15421 John Smith 10000 Masters 
2 15422 Peter Wolf 20000 PhD 
3 15423 Mark Simpson 30000 PhD 


Left Join: Returns all rows from the first data frame even if a matching value for the 
variable Code is not found in the second. 


left_join(emp, emp qual, by ="Code") 
Code First.Name Last.Name Salary.US.Dollar. Qual 


1 15421 John Smith 10000 Masters 
2 15422 Peter Wolf 20000 PhD 
3 15423 Mark Simpson 30000 PhD 
4 15424 Peter Buffet 40000 <NA> 
5 15425 Martin Luther 50000 <NA> 
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Right Join: Returns all rows from second data frame even if a matching value for the 
variable Code is not found in the first. 


right_join(emp, emp qual, by ="Code") 
Code First.Name Last.Name Salary.US.Dollar. Qual 


1 15421 John Smith 10000 Masters 
2 15422 Peter Wolf 20000 PhD 
3 15423 Mark Simpson 30000 PhD 
4 15426 <NA> <NA> NA PhD 
5 15429 <NA> <NA> NA Phd 


Full Join: Returns all rows from the first and second data frame whether or not a 
matching value for the variable Code is found. 


full_join(emp, emp qual, by ="Code") 
Code First.Name Last.Name Salary.US.Dollar. Qual 


1 15421 John Smith 10000 Masters 
2 15422 Peter Wolf 20000 PhD 
3 15423 Mark Simpson 30000 PhD 
4 15424 Peter Buffet 40000 <NA> 
5 15425 Martin Luther 50000 <NA> 
6 15426 <NA> <NA> NA PhD 
7 15429 <NA> <NA> NA Phd 


Note that the output of the merge and dplyr functions is exactly the same for the 
respective joins. dplyr is syntactically more meaningful but instead of one merge() 
function, we now have four different functions. You can also see that merge and dplyr are 
similar to implicit and explicit join statements in SQL query, respectively. 


2.2.3 Cleaning the Data 


The critical part of data wrangling is removing inconsistencies from the data, like missing 
values, and following a standard format in abbreviations. The process is a way to bring 
out the best quality of information from the data. 


2.2.3.1 Correcting Factor Variables 


Since R is a case-sensitive language, every categorical variable with definite set of values 
like the variable Qual in our employee dataset with PhD and Masters as two values needs 
to be checked for any inconsistencies. In R, such variables are called factors, with PhD 
and Masters as its two levels. So, a value like PhD and Phd are treated differently, even 
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though they mean the same. A manual inspection using the table() function will reveal 
such patterns. A way to correct this would be: 


employees qual <-read.csw("employees_qual.csv") 


#Inconsistent 
employees qual 
Code Qual 

1 15421 Masters 

2 15422 PhD 

3 15423 PhD 

4 15426 PhD 

5 15429 Phd 
employees_qual$Qual =as.character(employees qual$Qual) 
employees qual$Qual <-ifelse(employees qual$Qual %in%e("Phd", "phd", "PHd"), 
"PhD", employees qual$Qual) 


#Corrected 
employees qual 
Code Qual 


1 15421 Masters 
2 15422 PhD 
3 15423 PhD 
4 15426 PhD 
5 15429 PhD 


2.2.3.2 Dealing with NAs 


NAs (abbreviation for “Not Available”) are missing values and will always lead to wrong 
interpretation, exceptions in function output, and cause models to fail if we live with 
them until the end. The best way to handle NAs is either to remove/ignore if we are sitting 
in a big pool of data or if we couldn't afford to lose anything from the precious small 
dataset we have got, we impute, which is a process of filling the missing values. 

The technique of imputation has attracted many researchers to devise novel ideas 
but nothing can beat the simplicity that comes from the complete understanding of the 
data. Let’s take up an example from the merge we did previously, in particular the output 
from the right join. It’s not possible to impute First and Last Name, but it might not be 
relevant for any aggregate analysis we might want to do on our data. Rather, the variable 
that’s important for us is Salary, where we don't want to see NA. So, here is how we 
impute a value in the Salary variable. 


emp <-read.csw("employees.csv") 
employees qual <-read.csv("employees_qual.csv") 


#Correcting the inconsistency 

employees qual$Qual =as.character(employees qual$Qual) 

employees qual$Qual <-ifelse(employees qual$Qual %in%e("Phd", "phd", "PHd"), 
"PhD", employees _qual$Qual) 
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#Store the output from right_join in the variables impute salary 
impute salary <-right_join(emp, employees qual, by ="Code") 


#Calculate the average salary for each Qualification 
ave age <-ave(impute salary$Salary.US.Dollar., impute salary$Qual, 
FUN = function(x) mean(x, na.rm =TRUE)) 


#Fill the NAs with the average values 
impute_salary$Salary.US.Dollar. <-ifelse(is.na(impute salary$Salary. 
US.Dollar.), ave_age, impute salary$Salary.US.Dollar.) 


impute salary 
Code First.Name Last.Name Salary.US.Dollar. Qual 


1 15421 John Smith 10000 Masters 
2 15422 Peter Wolf 20000 PhD 
3 15423 Mark Simpson 30000 PhD 
4 15426 <NA> <NA> 25000 PhD 
5 15429 <NA> <NA> 25000 PhD 


Here, the idea is that a particular qualification is eligible for paychecks of a similar 
value, but there certainly is some level of assumption we have taken, that the industry 
isn't biased on paychecks based on which institution the employee obtained the degree. 
However, if there is a significant bias, then a measure like average might not be a right 
method; instead something like median could be used. We will discuss these kinds of bias 
in greater detail in the descriptive analysis section. 


2.2.3.3 Dealing with Dates and Times 


In many models, date and time variables play a pivotal role. Date and time variables 
reveal a lot about the temporal behavior, for instance, sales data of a supermarket or 
online store could give us details like most important time of the day with sales volume at 
peak, sales trend of weekday versus weekend, and much more. Often, dealing with date 
variables is a painful task, primarily because of many available date formats, time zones, 
and daylight savings in few countries. These challenges makes any arithmetic calculation 
like difference between days and comparing two date values even more difficult. 

The lubridate package is one of the most useful packages in R, and it helps in 
dealing with these challenges. The paper, “Dates and Times Made Easy with Lubridate,” 
published in the Journal of Statistical Software by Grolemund, describes the capabilities 
the lubridate package offers. To borrow from the paper, lubridate helps users: 


e Identify and parse date-time data 


e Extract and modify components of a date-time, such as years, 
months, days, hours, minutes, and seconds 


e Perform accurate calculations with date-times and timespans 


e Handle time zones and daylight savings time 
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The paper gives an elaborate description with many examples, so we will take up 
here only two uncommon date transformations like dealing with time zone and daylight 
savings. 


2.2.3.3.1 Time Zone 


If we wanted to convert the date and time labeled in Indian Standard Time (IST) (the 
local time standard of the system where the code was run) to Universal Coordinated time 
zone (UTC), we use the following code: 


library ("lubridate" ) 
date <-as.POSIXct ("2016-03-13 09:51:48") 
date 
[1] "2016-03-13 09:51:48 IST" 
with tz(date, "UTC") 
[1] "2016-03-13 04:21:48 UTC" 


2.2.3.3.2 Daylight Savings Time 


As the standard says, “daylight saving time (DST) is the practice of resetting the clocks 
with the onset of summer months by advancing one hour so that evening daylight stays 
an hour longer, while foregoing normal sunrise times.: 

For example, in the United States, the one-hour shift occurs at 02:00 local time, 
so in the spring, the clock is reset to advance by an hour from the last moment of 01:59 
standard time to 03:00 DST. That day has 23 hours. Whereas in autumn, the clock is reset 
to go backward from the last moment of 01:59 DST to 01:00 standard time, repeating 
that hour, so that day has 25 hours. A digital clock will skip 02:00, exactly at the shift to 
summer time, and instead advance from 01:59:59.9 to 03:00:00.0. 


dst_time <-ymd_hms("2010-03-14 01:59:59") 
dst_time <-force_tz(dst time, "America/Chicago" ) 
dst_time 

[1] "2010-03-14 01:59:59 CST" 


One second later, Chicago clock times read: 


dst_time +dseconds(1) 
[1] "2010-03-14 03:00:00 CDT" 


The force_tz() function forces a change of time zone based on the parameter we pass. 


2.2.4 Supplementing with More Information 


The best models are not built with raw data available at the beginning, but come from the 
intelligence shown in deriving a new variable from an existing one. For instance, a date 
variable from sales data of a supermarket could help in building variables like weekend 
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(1/0), weekday (1/0), and bank holiday (1/0), or combing multiple variable like income 
and population could lead to Per Capita Income. Such creativity on derived variables 
usually comes with lot of experience and domain expertise. However, there could be 
some common approaches on standard variables such as date, which will be discussed in 
detail here. 


2.2.4.1 Derived Variables 


Deriving new variables requires lot of creativity. Sometimes it demands a purpose, 
situations where a derived variable helps to explain certain behavior. For example, 

while looking at the sales trend of any online store, we see a sudden surge in volume 

on a particular day, so on further investigation we found the reason to be a heavy 
discounting for end-of-season sales. So, if we include a new binary variable EOS Sales 
assuming a value 1, if we had end of season sales or 0 otherwise, we may aid the model in 
understanding why a sudden surge is seen in the sales. 


2.2.4.2 n-day Averages 


Another useful technique for deriving such variables, especially in time series data from 
the stock market, is to derive variables like last_7 days, last_2 weeks, andlast 1_ 
month average stock prices. Such variables work to reduce the variability in data like stock 
prices, which can sometime seem like noise and can hamper the performance of the 
model to a great extent. 


2.2.5 Reshaping 


In many modeling exercises, it’s a common practice to reshape the data into a more 
meaningful and usable format. Here, we show one example dataset from World Bank on 
World Development Indicators (WDI). The data has a wide set of variables explaining the 
various attributes for developments starting from the year 1950 until 2015. A very rich 
data and large dataset. 

A small sample of development indicators and its values for the country Zimbabwe 
for the years 1995 and 1998: 


library(data.table) 

WDI Data <-fread("WDI Data.csv", header =TRUE, skip =333555, select 
=¢€(3,40,43)) 

setnames(WDI Data, ¢("Dev_ Indicators", "1995","1998")) 

WDI Data <-WDI Data[e(1,3), | 


DevelopmentIndicators (DI): 
WDI_ Data[,"Dev_ Indicators", with =FALSE] 
Dev_Indicators 


1: Women's share of population ages 15+ living with HIV (%) 
2: Youth literacy rate, population 15-24 years, female (%) 
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DI Value for the years 1995 and 1998: 


WDI Data[,2:3, with =FALSE] 


1995 1998 
1: 56.02648 56.33425 
2: NA NA 


This data has in each row a development indicator and columns representing its value 
from the year starting 1995 to 1998. Now, using the package tidyr, we will reshape this 
data to have the columns 1995 and 1998, into one column called Year. This transformation 
will come pretty handy when we will see the data visualization in Chapter 4. 


library (tidyr) 
gather(WDI Data, Year,Value, 2:3) 
Dev_Indicators Year Value 
1: Women's share of population ages 15+ living with HIV (%) 1995 56.02648 


2: Youth literacy rate, population 15-24 years, female (%) 1995 NA 
3: Women's share of population ages 15+ living with HIV (%) 1998 56.33425 
4: Youth literacy rate, population 15-24 years, female (%) 1998 NA 


There are many such ways of reshaping our data, which we will describe as we look 
at many case studies throughout the book. 


2.3 Exploratory Data Analysis 


EDA provides a framework to choose the appropriate descriptive methods in various data 
analysis needs. Tukey’s, in his book Exploratory Data Analysis, emphasized the need to 
focus more on suggesting hypothesis using data rather than getting involved in many 
repetitive statistical hypothesis testing. Hypothesis testing in statistics is a tool for making 
certain confirmatory assertions drawn from data or more formally, statistically proving 
the significance of an insight. More on this later in the next section. EDA provides both 
visual and quantitative techniques for data exploration. 

Tukey's EDA gave birth to two path-breaking developments in statistical theory: 
robust statistics and non-parametric statistics. Both of these ideas had a big role in 
redefining the way people perceived statistics. It’s no more a complicated bunch of 
theorems and axioms but rather a powerful tool for exploring data. So, with our data in 
the most desirable format after cleaning up, we are ready to deep dive into the analysis. 

Let’s first take a simple example and understand these statistics. Consider a 
marathon of approximately 26 miles and the finishing times (in hours) of 50 marathon 
runners. There are runners ranging from world-class elite marathoners to first-timers 
who walk all the way. 
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Table 2-1. A Snippet of This Data 


ID Type Finishing Time 
1 Professional 2.2 
2 First-Timer io 
3 Frequents 4.3 
4 Professional 23 
5 Frequents 5.1 
6 First-Timer 8.3 


This dataset will be used throughout to explain the various exploratory analysis. 


2.3.1 Summary Statistics 


In statistics, what we call “Summary Statistics” for any numerical variables in the dataset 
are the genesis for data exploration. These are Minimum, First Quartile, Median, Mean, 
Third Quartile, and Maximum. These numbers explain a great deal about the data. It’s 
easy to calculate all these in R using the summary function. 


marathon <-read.csv("marathon.csv") 

summary (marathon) 

Id Type Finish Time 
Min. : 1.00 First-Timer :17 Min. :1.700 
1st Qu.:13.25 Frequents :19 1st Qu.:2.650 
Median :25.50 Professional:14 Median :4.300 


Mean :25.50 Mean :4.651 

3rd Qu. :37.75 3rd Qu.:6.455 

Max. 250.00 Max. 29.000 
quantile(marathon$Finish Time, 0.25) 

25% 

2.65 


For categorical variables, the summary function simply gives the count of each 
category as seen with the variable Type. In case of the numerical variables, apart from 
minimum and maximum, which are quite straight forward to understand, we have mean, 
median, first quartile, and third quartile. 


2.3.1.1 Quantile 


If we divide our population of data into four equal groups, based on the distribution of 
values of a particular numerical variable, then each of the three values creating the four 
divides are called first, second, and third quartile. In other words, the more general term 
is quantile; q-Quantiles are values that partition a finite set of values into q subsets of 
equal sizes. 
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For instance, dividing in four equal groups would mean a 3-quantile. In terms of 
usage, percentile is more widely used terminology, which is a measure used in statistics 
indicating the value under which a given percentage of observations in a group of 
observations fall. If we divide something in 100 equal groups, we have 99-Quantiles, 
which leads us to define the first quartile as the 25th percentile and the third quartile as 
the 75th percentile. In simpler terms, the 25th percentile or first quartile is a value below 
which 25 percent of the observations are found. Similarly, 75th percentile or third quartile 
is a value below which 75 percent of the observations are found. 

First Quartile: 


quantile(marathon$Finish Time, 0.25) 
25% 
2.65 


Second Quartile or Median: 


quantile(marathon$Finish Time, 0.5) 
50% 
4.3 

#Another function to calculate median 


median(marathon$Finish Time) 
[1] 4.3 


Third Quartile: 


quantile(marathon$Finish Time, 0.75) 
75% 
6.455 


The interquartile range is the difference between the 75th percentile and 25th 
percentile, would be the range that contains 50% of the data of any particular variable 
in the dataset. Interquartile range is a robust measure of statistical dispersion. We will 
discuss this further in the later part of the section. 


quantile(marathon$Finish Time, 0.75, names =FALSE) 
-quantile(marathon$Finish Time, 0.25, names =FALSE) 
[1] 3.805 


2.3.1.2 Mean 


Though median is a robust measure of central tendency of any distribution of data, mean 
is a more traditional statistic for explaining the statistical property of the data distribution. 
The median is more robust because a single large observation can throw the mean off. We 
formally define this statistic in the next section. 


mean(marathon$Finish Time) 
[1] 4.6514 
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As you would expect, the summary (mean and median are often counted one as a 
measure of centrality ) listed by the summary function's output are in the increasing order 
of their values. The reason for this is obvious from the way they are defined. And if these 
statistical definitions were difficult for you to contemplate, don't worry, we will turn to 
visualization for explanation. Though we have a dedicated chapter on it, here we discuss 
some very basic plots that are inseparable from theories around any exploratory analysis. 


2.9.1.3 Frequency Plot 


A frequency plot is showing the number of runners in each type. Such simple plots explain 
the distribution of a categorical variable. As seen in the plot, we how many first-timers, 
frequents, and professional runners participated in the marathon. 


plot(marathon$Type, xlab ="Marathoners Type", ylab ="Number of Marathoners") 


10 15 
| 


Number of Marathoners 
5 
| 





First-Timer Frequents Professiona 


Marathoners Type 


Figure 2-1. Number of athletes in each type 


2.3.1.4 Boxplot 


A boxplot is the alternative to the summary statistics in visualization. Though looking 
at numbers is always useful, an equivalent representation of the same in a visually 
appealing plot could serve as an excellent tool for better understanding, insight 
generation, and ease of explaining the data. 

In the summary, we saw the values for each variable but were not able to look how 
the finish time varies for each type of runner. In other words, how type and finish time are 
related. Figure 2-2 clearly helps to illustrate this relationship. As expected, the boxplot clearly 
shows that professionals have a much better finish time than frequents and first-timers. 


boxplot(Finish Time ~Type,data=marathon, main="Marathon Data", xlab="Type of 
Marathoner", ylab="Finish Time") 
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Marathon Data 


Finish Time 





First-Timer Frequents Professional 


Type of Marathoner 


Figure 2-2. Boxplot showing variation of finish times for each type of runner 


2.3.2 Moment 


Apart from the summary statistics, we have other statistics like variance, standard 
deviation, skewness, kurtosis, covariance, and correlation. These statistics naturally lead 
us to look for some distribution in the data. 

More formally, we are interested in the quantitative measure called the moment. Our 
data point represents a probability density function that describes the relative likelihood 
of a random variable to take on a given value. The random variables are the attributes 
of our dataset. In the marathon example, we have the Finish_Time variable describing 
the finishing time of each marathoner. So, for the probability density function, we are 
interested in the first five moments. 


e The zeroth moment is the total probability (i.e., one) 
e The first moment is the mean 


e The second central moment is the variance; it’s a positive square 
root of the standard deviation 


e The third moment is the skewness 
e The fourth moment (with normalization and shift) is the kurtosis 


Let’s look at the second, third, and fourth moments in detail. 

The literature on exploratory data analysis is so rich with all the exemplary works of 
J.W. Tukey, that it’s very hard for his admirers to not refer his work. So here is another one 
from his classic, The future of Data Analysis: 


We were together learning how to use the analysis of variance, and perhaps it is worth 
while stating an impression that I have formed-that the analysis of variance, which may 
perhaps be called a statistical method, because that term is a very ambiguous one - is not 
a mathematical theorem, but rather a convenient method of arranging the arithmetic. 
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Just as in arithmetical textbooks—if we can recall their contents—we were given rules 

for arranging how to find the greatest common measure, and how to work out a sum in 
practice, and were drilled in the arrangement and order in which we were to put the figures 
down, so with the analysis of variance; its one claim to attention lies in its convenience. 


So, fundamentally, after mean, variance will form the basis for many other statistical 
methods to analyze and understand the data better. 


2.3.2.1 Variance 


Variance is a measure of the spread for the given set of numbers. The smaller the 
variance, the closer the numbers are to the mean and the larger the variance, the 
farther away the numbers are from the mean. Variance is an important measure 
for understanding the distribution of the data, more formally it’s called probability 
distribution. In the next chapter, where various sampling techniques are discussed, we 
examine how a sample variance is considered to be an estimate of the full population 
variance, which forms the basis for a good sampling method. Depending on whether our 
variable is discrete or continuous, we can define the variance. 

Mathematically, for a set of n equally likely numbers for a discrete random variable, 
variance can be represented as follows: 


a” = ty (x, —p) 
N j= 


And more generally, if every number in our distribution occurs with a probability p, 
variance is given by: 


l n 
o’ = ÀP *«(X; -u 
i=1 


As seen in the formula, for every data point, we are measuring how far the number 
is from the mean, which translates into a measure of spread. Equivalently, if we take the 
square root of variance, the resulting measure is called the standard deviation, generally 
written as a sigma. The standard deviation has the same dimension as the data, which 
makes it convenient to compare with the mean. Together, both mean and standard 
deviation, can be used to describe any distribution of data. Let’s take a look at the 
variance of the variable Finish_Time from our marathon data. 


mean(marathon$Finish Time) 
[1] 4.6514 

var(marathon$Finish Time) 
[1] 4.342155 

sd(marathon$Finish Time) 
[1] 2.083784 
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Looking at the values of mean and standard deviation, we could say, on average, that 
the marathoners have a finish time of 4.65 +/- 2.08 hours. Further, it’s easy to notice from 
the following code snippet that each type of runner has their own speed of running and 
hence a different finish time. 


tapply(marathon$Finish Time,marathon$Type, mean) 
First-Timer Frequents Professional 
7.154118 4.213158 2.207143 
tapply(marathon$Finish Time,marathon$Type, sd) 
First-Timer Frequents Professional 
0.8742358 0.5545774 0.3075068 


2.3.2.2 Skewness 


As variance is a measure of spread, skewness measures asymmetry about the mean of the 
probability distribution of a random variable. In general, as the standard definition says, 
we could observe two types of skewness: 


e Negative skew: The left tail is longer; the mass of the distribution 
is concentrated on the right. The distribution is said to be left- 
skewed, left-tailed, or skewed to the left. 


e Positive skew: The right tail is longer; the mass of the distribution 
is concentrated on the left. The distribution is said to be right- 
skewed, right-tailed, or skewed to the right. 


Mathematicians discuss skewness in terms of the second and third moments around 
the mean. A more easily interpretable formula could be written using standard deviation. 


N 


pace -u)} /N 
81 == 3 
O 


This formula for skewness is referred to as the Fisher-Pearson coefficient of skewness. 
Many software programs actually compute the adjusted Fisher-Pearson coefficient of 
skewness, which could be thought of as a normalization to avoid too high or too low 
values of skewness: 


N 
3 
JNN) BOW N 
G, = i=l 5 
N-1 oO 
Let’s use the histogram of beta distribution to demonstrate skewness. 
library ("moments") 


Warning: package ‘moments’ was built under R version 3.2.3 
par (mfrow=¢(1,3), mar=e(5.1,4.1,4.1,1)) 
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# Negative skew 
hist (rbeta(10000,2,6), main ="Negative Skew" ) 
skewness (rbeta(10000, 2,6) ) 
[1] 0.7166848 
# Positive skew 
hist (rbeta(10000,6,2), main ="Positive Skew") 
skewness (rbeta(10000, 6,2) ) 
[1] -0.6375038 
# Symmetrical 


hist (rbeta(10000,6,6), main ="Symmetrical") 


skewness (rbeta(10000, 6,6) ) 
[1] -0.03952911 
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Figure 2-3. Distribution showing symmetrical versus negative and positive skewness 


For our marathon data, the skewness is close to 0, indicating a symmetrical distribution. 
hist(marathon$Finish Time, main ="Marathon Finish Time") 


skewness(marathon$Finish Time) 
[1] 0.3169402 
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Figure 2-4. Distribution of finish time of athletes in marathon data 


2.3.2.0 Kurtosis 


Kurtosis is a measure of peakedness and tailedness of the probability distribution of a 
random variable. Similar to skewness, kurtosis is also used to describe the shape of the 
probability distribution function. In order words, kurtosis explains the variability due to a 
few data points having extreme differences from the mean, rather than lot of data points 
having smaller differences from the mean. Higher values indicate a higher and sharper 
peak and lower values indicate a lower and less distinct peak. Mathematically, kurtosis 

is discussed in terms of the fourth moment around the mean. It’s easy to find that the 
kurtosis for a standard normal distribution is 3, a distribution known for its symmetry, 
and since kurtosis like skewness measures any asymmetry in data, many people use the 
following definition of kurtosis: 


S(x, -pt /N 


kurtosis = r E. 
o 


Generally, there are three types of kurtosis: 


e Mesokurtic: Distributions with a kurtosis value close to 3, which 
means in the previous formula, the term before 3 becomes 0, a 
standard normal distribution with mean 0 and standard deviation 1. 


e Platykurtic: Distributions with a kurtosis value < 3. Comparatively, 
a lower peak and shorter tails than normal distribution. 


e Leptokurtic: Distributions with a kurtosis value > 3. 
Comparatively, a higher peak and longer tails than normal 
distribution. 
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While the kurtosis statistic is often used by many to numerically describe a sample, 
it is said that, “there seems to be no universal agreement about the meaning and 
interpretation of kurtosis” Tukey suggests that, like variance and skewness, kurtosis 
should be viewed as a “vague concept” that can be formalized in a variety of ways. 


#leptokurtic 
set.seed(2) 
random numbers <-rnorm(20000,0,0.5) 
plot(density(random numbers), col ="blue", main ="Kurtosis Plots", lwd=2.5, 
asp =4) 
kurtosis(random numbers) 

[1] 3.026302 
#platykurtic 
set.seed (900) 
random numbers <-rnorm(20000,0,0.6) 
lines(density(random numbers), col ="red", lwd=2.5) 
kurtosis(random numbers) 

[1] 2.951033 
#mesokurtic 
set.seed (3000) 
random numbers <-rnorm(20000,0,1) 
lines(density(random numbers), col ="green", lwd=2.5) 
kurtosis(random numbers) 

[1] 3.008717 
legend(1,0.7, c("leptokurtic", "platykurtic","mesokurtic" ), 
lty=e(1,1), 
Iwd=¢(2.5,2.5),col=e("blue", "red", "green") ) 
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Figure 2-5. Showing kurtosis plots with simulated data 
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Comparing these kurtosis plots to the marathon finish time, it’s platykurtic with a 
very low peak and short tail. 


plot (density(as.numeric(marathon$Finish Time)), col ="blue", main ="Kurtosis 
Plots", lwd=2.5, asp =4) 
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Figure 2-6. Showing kurtosis plot of finish time in marathon data 


kurtosis(marathon$Finish_Time) 
[1] 1.927956 


2.4 Case Study: Credit Card Fraud 


In order to apply the concepts explained so far in this chapter, this section presents 
simulated data on credit card fraud. The data is approximately 200MB, which is big 
enough to explain most of the ideas discussed. Reference to this dataset will be made 
quite often throughout the book. So, if you have any thoughts of skipping this section, we 
strongly advise you not to do so. 


2.4.1 Data Import 


We will use the package data.table. It offers fast aggregation of large data (e.g., 100GB in 
RAM), fast ordered joins, fast add/modify/delete of columns by group using no copies at all, 
list columns, and a fast file reader (fread). Moreover, it has a natural and flexible syntax, for 
faster development. Let's start by looking at how this credit card fraud data looks. 


library (data.table) 

data <-fread("ccFraud.csv",header=T, verbose =FALSE, showProgress =FALSE) 

str(data) 

Classes ‘data.table' and ‘data.frame': | 10000000 obs. of 9 variables: 
$ custID > int 12345678910... 
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$ gender > int 1221121121... 

$ state : int 35 2 2 15 46 44 3 10 32 23... 

$ cardholder : int 1111121111... 

$ balance : int 3000 0 O O O 5546 2000 6016 2428 0... 
$ numTrans : int 49 27 12 11 21 41 20 4 18... 


$ numIntlTrans: int 14090 16003 10 56... 
$ creditLine : int 2 18 16571316225... 
$ fraudRisk : int 0000000000... 
- attr(*, ".internal.selfref")=<externalptr> 


The str displays variables in the dataset with few sample values. Following are the 

nine variables: 

e custID: A unique identifier for each customer 

e gender: Gender of the customer 

e state: State in the United States where the customer lives 

e cardholder: Number of credit cards the customer holds 

e balance: Balance on the credit card 

e numTrans: Number of transactions to date 

e numIntlTrans: Number of international transactions to date 


e creditLine: The financial services corporation, such as Visa, 
MasterCard, and American Express 


e  fraudRisk: Binary variable, 1 means customer being frauded, 0 
means otherwise 


2.4.2 Data Transformation 


Further, it’s clear that variables like gender, state, and creditLine are mapped to 

numeric identifiers. In order to understand the data better, we need to remap these 

numbers back to their original meaning. We can do this using the merge function in R. 

The file US State Code Mapping.csv contains the mapping for every U.S. State and the 

numbers in state variables in the datasets. Similarly, Gender Map.csv and credit line 

map.csv contain the mapping for the variables gender and creditLine, respectively. 
Mapping U.S. State 


library(data.table) 
US state <-fread("US State Code Mapping.csv",header=T, showProgress =FALSE) 
data<-merge(data, US state, by ='state') 


Mapping Gender 


library(data.table) 
Gender _map<-fread("Gender Map.csv" ,header=T) 
data<-merge(data, Gender _ map, by ='gender' ) 
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Mapping Credit Line 


library (data.table) 
Credit_line<-fread("credit line map.csv",header=T) 
data<-merge(data, Credit line, by ='creditLine' ) 


Setting Variable Names and Displaying New Data 


setnames(data, "custID", "CustomerID" ) 
setnames (data, "code", "Gender" ) 


setnames(data, "numTrans", "DomesTransc") 
setnames(data, "numIntlTrans", "IntTransc") 
setnames (data, "fraudRisk", "FraudFlag" ) 
setnames (data, "cardholder", "NumOfCards" ) 
setnames(data, "balance", "OutsBal") 


setnames(data, "StateName", "State") 


str(data) 
Classes ‘data.table' and ‘data.frame': 10000000 obs. of 11 variables: 
$ creditLine : int 1111111111... 
$ CustomerID : int 4446 59161 136032 223734 240467 248899 262655 324670 
390138 482698 ... 
¢ NumOfCards : int 1111111111... 
$ OutsBal : int 2000 O 2000 2000 2000 O O 689 2000 O ... 
$ DomesTransc: int 31 25 78 11 40 47 15 17 48 25... 
$ IntTransc : int 90300009035... 
$ FraudFlag : int 0000000000... 


$ State : chr "Alabama" "Alabama" “Alabama” "Alabama" ... 

$ Gender : chr "Male" "Male" "Male" "Male" ... 

$ CardType : chr “American Express" “American Express" "American 
Express" "American Express" ... 

$ CardName : chr “SimplyCash® Business Card from American Express" 


"SimplyCash® Business Card from American Express" "SimplyCash® Business Card 
from American Express" "“SimplyCash® Business Card from American Express" ... 
- attr(*, ".internal.selfref")=<externalptr> 
- attr(*, "sorted")= chr "creditLine" 


2.4.3 Data Exploration 


Since, the data wasn't too dirty, we managed to skip most of the data-wrangling 
approaches steps described earlier in the chapter. However, in real-world problems, 
the data transformation task is not so easy; it requires painstaking effort and data 
engineering. We will use such approaches in later case studies in the book. In this case, 
our data is ready to be explored in more detail. Let’s start the exploration. 
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summary (data[,¢("NumOfCards", "OutsBal", "DomesTransc", 
"IntTransc"),with =FALSE]) 


NumOfCards OutsBal DomesTransc IntTransc 
Min. 71.00 Min. : o Min. : 0.00 Min. : 0.000 
1st Qu.:1.00 1st Qu.: O 1st Qu.: 10.00 1st Qu.: 0.000 
Median :1.00 Median : 3706 Median : 19.00 Median : 0.000 
Mean 71.03 Mean : 4110 Mean : 28.94 Mean > 4.047 
3rd Qu.:1.00 3rd Qu.: 6000 3rd Qu.: 39.00 3rd Qu.: 4.000 
Max. 72.00 Max. 741485 Max. 7100.00 Max. : 60.000 


So, if we want to understand the behavior of the number of transactions between 
men and women, it looks like there is no difference. Men and women shop equally, as 
shown in Figure 2-7. 


boxplot(I(DomesTransc +IntTransc ) ~Gender, data = data) 
title("Number of Domestic Transaction") 
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Figure 2-7. The number of domestic transactions sorted by male and female 


tapply(I(data$DomesTransc +data$IntTransc),data$Gender, median) 
Female Male 


24 24 
tapply(I(data$DomesTransc +data$IntTransc),data$Gender, mean) 
Female Male 


32.97612 32.98624 
Now, let’s look at the frequencies of the categorical variables. 


Distribution of frauds across the card type are shown here. This type of frequency 
table tell us which categorical variable is prominent for the fraud cases. 
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table(data$CardType, data$FraudF lag) 


0 1 
American Express 2325707 149141 
Discover 598246 44285 
MasterCard 3843172 199532 
Visa 2636861 203056 


You can see from the frequency table that highest frauds have happened to Visa 
cards, followed by MasterCard and American Express. The lowest frauds are reported 
from Discover. The number of frauds defines the event rate for modeling purposes. 
Event rate is the proportion of events ( i.e., fraud) versus the number of records for each 
category. 

Similarly, you can see frequency plots for fraud and gender and fraud and state. 


table(data$Gender , data$FraudFlag) 


0 1 
Female 3550933 270836 
Male 5853053 325178 


Frauds are reported more from males; the event rate of fraud in the male category 
is 325178/(325178+5853053) = 5.2%. Similarly, the event rate in the female category is 
270836/(2708364+3550933) = 7.1%. Hence, while males have more frauds, the event rate 
is higher for female customers. In both cases, the event rate is low, so we need to look for 
sampling so that we get a high event rate in the modeling dataset. 


2.5 Summary 


In upcoming chapters, we explain how to enrich this data to be able to model it and 
quantify these relationships for a predictive model. The next chapter will help you 
understand how you can reduce your dataset and at the same time enhance its properties 
to be able to apply machine learning algorithms. 

While it’s always good to say that more data implies a better model, there might be 
occasions where the luxury of sufficient amount of data is not there or computational 
power is limited to only allow a certain size of dataset. In such situations, statistics could 
help sample a precise and informative subset of data without compromising much on the 
quality of the model. Chapter 3 focuses on many such sampling techniques that will help 
in achieving this objective. 
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CHAPTER 3 


Sampling and Resampling 
Techniques 





In Chapter 2, we introduced the concept of data import and exploration techniques. 
Now you are equipped with loading data from different sources and storing them 

in an appropriate format. In this chapter we will discuss important data sampling 
methodologies and their importance in machine learning algorithms. Sampling 

is an important block in our machine learning process flow and it serves the dual 
purpose of cost savings in data collection and reduction in computational cost without 
compromising the power of the machine learning model. 


“An approximate answer to the right problem is worth a good deal more 
than an exact answer to an approximate problem.” 
—John Tukey 


John Tukey statement fits well into the spirit of sampling. As the technological 
advancement brought large data storage capabilities, the incremental cost of applying 
machine learning techniques is huge. Sampling helps us balance between the cost of 
processing high volumes of data with marginal improvement in the results. Contrary 
to the general belief that sampling is useful only for reducing a high volume of data to a 
manageable volume, sampling is also important to improve statistics garnered from small 
samples. In general, machine learning deals with huge volumes of data, but concepts like 
bootstrap sampling can help you get insight from small sample situations as well. 

The learning objectives of this chapter are as follows: 


e Introduction to sampling 

e Sampling terminology 

e Non-probability sampling and probability sampling 
e Business implication of sampling 

e Statistical theory on sample statistics 


e Introduction to resampling 
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e Monte Carlo method: Acceptance-Rejection sampling 
e Computational time saving illustration 


Different sampling techniques will be illustrated using the credit card fraud data 
introduced in Chapter 2. 


3.1 Introduction to Sampling 


Sampling is a process that selects units from a population of interest, in such a way that 
the sample can be generalized for the population with statistical confidence. For instance, 
if an online retailer wanted to know the average ticket size of an online purchase over the 
last 12 months, we might not want to average the ticket size over the population (which 
may run into millions of data points for big retailers), but we can pick up a representative 
sample of purchases over last 12 months and estimate the average for the sample. 
The sample average then can be generalized for the population with some statistical 
confidence. The statistical confidence will vary based on the sampling technique used 
and the size. 

In general, sampling techniques are applied to two scenarios, for creating 
manageable dataset for modeling and for summarizing population statistics. This broad 
categorization can be presented as objectives of sampling: 


e Model sampling 
e Survey sampling 


Model sampling is done when the population data is already collected and you want 
to reduce time and the computational cost of analysis, along with improve the inference 
of your models. Another approach is to create a sample design and then survey the 
population only to collect sample to save data collection costs. Figure 3-1 shows the two 
business objectives of sampling. The sample survey design and evaluation are out of 
scope of this book, so will keep our focus on model sampling alone. 
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Model 
Sampling 


Population: Data has 
already been collected 
from population. 


Sampling: The 
sampling is done to 
reduce computational 
cost and improve 
performance. 


Application: Financial 
Modeling, Product 
development, Risk 

Assessment etc. 
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Survey 
Sampling 


Population: We know 
the population from 
where to collect data 


Sampling: Sampling is 
done to reduce data 
collection cost and 
have targeted data for 
hypothesis testing 


Application: 
Marketing, 
Macroeconomics, 
Clinical trials etc. 





Figure 3-1. Objectives of sampling 


This classification is also helpful in identifying the end objectives of sampling. 
This helps in choosing the right methodology for sampling and the right exploratory 
technique. In the context of the machine learning model building flow, our focus will be 
around model sampling. The assumption is that the data has already been collected and 
our end objective is to garner insight from that data, rather than develop a systematic 
survey to collect it. 


3.2 Sampling Terminology 


Before we get into details of sampling, let’s define some basic terminology of sampling 
that we will be using throughout the book. The statistics and probability concepts 
discussed in Chapter 1 will come handy in understanding the sampling terminology. This 
section lists the definition and mathematical formulation in sampling. 


3.2.1 Sample 


A sample is a set of units or individuals selected from a parent population to provide 
some useful information about the population. This information can be general 
shape of distribution, basic statistics, properties of population distribution parameters, 
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or information of some higher moments. Additionally, the sample can be used to 
estimate test statistics for hypothesis testing. A representative sample can be used to 
estimate properties of population or to model population parameters. 

For instance, the National Sample Survey Organization (NNSO) collects sample data 
on unemployment by reaching out to limited households, and then this sample is used to 
provide data for national unemployment. 


3.2.2 Sampling Distribution 


The distribution of the means of a particular size of samples is called the sampling 
distribution of means; similarly the distribution of the corresponding sample variances is 
called the sampling distribution of the variances. These distributions are the fundamental 
requirements for performing any kind of hypothesis testing. 


3.2.3 Population Mean and Variance 


Population mean is the arithmetic average of the population data. All the data points 
contribute toward the population mean with equal weight. Similarly, population variance 
is the variance calculated using all the data points in the data. 


Population mean: 2 Ai 





Population Variance: o° = D -u 
n 


i=l 


3.2.4 Sample Mean and Variance 


Any subset you draw from the population is a sample. The mean and variance obtained 
from that sample are called sample statistics. The concept of degrees of freedom is used 
when a sample is used to estimate distribution parameters; hence, you will see for sample 
variance that the denominator is different than the population variance. 
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Sample variance: s’ = } (x; 
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3.2.5 Pooled Mean and Variance 


For k sample of size n,,n,,n.,...,n, taken from the same population, the estimated 
population mean and variance are defined as follows. 
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Estimated population mean: 
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In real-life situations, we usually can have multiple samples drawn from the same 
population at different points in space/location and time. For example, assume we have 
to estimate average income of bookshop owner in a city. We will get samples of bookshop 
owners’ income from different parts of city at different points in time. At a later point of 
time, we can combine the individual mean and variance from different samples to get an 
estimate for population using pooled mean and variance. 


3.2.6 Sample Point 


A possible outcome in a sampling experiment is called a sample point. In many types of 
sampling, all the data points in the population are not sample points. 

Sample points are important when the sampling design becomes complex. The 
researcher may want to leave some observations out of sampling, alternatively the 
sampling process by design itself can give less probability of selection to the undesired 
data point. For example, suppose you have gender data with three possible values—Male, 
Female, and Unknown. You may want to discard all Unknowns as an error, this is keeping 
the observation out of sampling. Otherwise, if the data is large and the Unknowns are a 
very small proportion then the probability to sample them is negligible. In both cases, 
Unknown is not a sample point. 


3.2.7 Sampling Error 


The difference between the true value of a population statistic and the sample statistic is 
the sampling error. This error is attributed to the fact that the estimate has been obtained 
from the sample. 

For example, suppose you know by census data that monthly average income of 
residents in Boston is $3,000 (the population mean). So, we can say that true mean 
of income is $3,000. Let’s say that a market research firm performed a small survey of 
residents in Boston. We find that the sample average income from this small survey 
is $3,500. The sampling error is then $3,500 - $3,000, which equals $500. Our sample 
estimates are over-estimating the average income, which also points to the fact that the 
sample is not a true representation of the population. 
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3.2.8 Sampling Fraction 


ae ; ; : ; n 
The sampling fraction is the ratio of sample size to population size, f = i 


For example, if your total population size is 500,000 and you want to draw a sample 
of 2,000 from the population, the sampling fraction would be f = 2,000/50,000 = 0.04. In 
other words, 4% of population is sampled. 


3.2.9 Sampling Bias 


Sampling bias occurs when the sample units from the population are not characteristic of 
(i.e., do not reflect) the population. Sampling bias causes a sample to be unrepresentative 
of the population. 

Connecting back to example from sampling error, we found out that the sample 
average income is way higher than the census average income (true average). This means 
our sampling design has been biased toward higher income residents of Boston. In that 
case, our sample is not a true representation of Boston residents. 


3.2.10 Sampling Without Replacement (SWOR) 


Sampling without replacement requires two conditions to be satisfied; 


e Each unit/sample point has a finite non-zero probability of 
selection 


e Once a unit is selected, it is removed from the population 


In other words, all the units have some finite probability of being sampled strictly 
only once. 

For instance, if we have a bag of 10 balls, marked with numbers 1 to 10, then each ball 
has selection probability of 1/10 in a random sample done without replacement. Suppose 
we have to choose three balls from the bag, then after each selection the probability 
of selection increases as number of balls left in bag decreases. So, for the first ball the 
probability of getting selected is 1/10, for the second it’s 1/9, and for the third it’s 1/8. 


3.2.11 Sampling with Replacement (SWR) 


Sampling with replacement differs from SWOR by the fact that a unit can be sampled 
more than once in the same sample. Sampling with replacement requires two conditions 
to be satisfied; 


e Each unit/sample point has a finite non-zero probability of selection 


e Aunit can be selected multiple times, as the sampling population 
is always the same 


In sampling without replacement, the unit can be sampled more than once and each 
time has the same probability of getting sampled. This type of sampling virtually expands 
the size of population to infinity as you can create as many samples of any size from this 
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method Connecting back to our previous example in SWOR, if we have to choose three 
balls with SWR, each ball will have the exact same finite probability of 1/10 for sampling. 

The important thing to note here is sampling with replacement technically makes 
the population size infinite. Be careful while choosing SWR as in most cases each 
observation is unique and counting it multiple times creates bias in your data. Essentially 
it will mean you are allowing a repetition of observation. For example, 100 people having 
the same name, income, age, and gender in the sample will create bias in the dataset. 


3.3 Credit Card Fraud: Population Statistics 


The credit card fraud dataset is a good example of how to build a sampling plan for machine 
learning algorithms. The dataset is huge, with 10 million rows and multiple features. This 
section will show you how the key sampling measure of population can be calculated and 
interpreted for this dataset. The following statistical measures will be shown: 


e Population mean 
e Population variance 
e Pooled mean and variance 


To explain these measures, we chose the outstanding balance feature as the quantity 
of interest. 


3.3.1 Data Description 


A quick recap from Chapter 2 to describe the following variables in the credit card 
fraud data, 


e custID: A unique identifier for each customer 

e gender: Gender of the customer 

e state: State in the United States where the customer lives 

e cardholder: Number of credit cards the customer holds 

e balance: Balance on the credit card 

e numTrans: Number of transactions to date 

e numIntlTrans: Number of international transactions to date 


e creditLine: The financial services corporation, such as Visa, 
MasterCard, or American Express 


e fraudRisk: Binary variable, 1 means customer being frauded, 0 
means otherwise 


str(data) 

Classes ‘data.table' and 'data.frame': | 10000000 obs. of 14 variables: 
$ creditLine : int 1111111111... 
$ gender : int 1111111111... 
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$ state : int 1111111111... 

$ CustomerID : int 4446 59161 136032 223734 240467 248899 262655 324670 
390138 482698 ... 

$ NumOfCards : int 1111111111... 

$ OutsBal : int 2000 O 2000 2000 2000 O O 689 2000 0 ... 

$ DomesTransc: int 31 25 78 11 40 47 15 17 48 25... 

$ IntTransc : int 90300009035... 

$ FraudFlag > int 0000000000... 


$ State : chr "Alabama" "Alabama" "Alabama" "Alabama" ... 

$ PostalCode : chr "AL" "AL" "AL" "AL" ... 

$ Gender : chr "Male" "Male" "Male" "Male" ... 

$ CardType : chr “American Express" “American Express" “American 
Express" “American Express’ ... 

$ CardName : chr "SimplyCash® Business Card from American Express" 


"SimplyCash® Business Card from American Express" "SimplyCash® Business Card 
from American Express" "SimplyCash® Business Card from American Express" ... 
- attr(*, ".internal.selfref")=<externalptr> 
- attr(*, "sorted")= chr "creditLine" 


As stated earlier, we chose outstanding balance as the variable/feature of interest. 
In the str() output for data descriptive, we can see the outstanding balance is stored in 
a variable named OutsBal, which is of type integer. Being a continuous variable, mean 
and variance can be defined for this variable. 


3.3.2 Population Mean 


Mean is a more traditional statistic for measuring the central tendency of any distribution 
of data. The mean outstanding balance of our customers in the credit card fraud dataset 
turns out to be $4109.92. This is our first understanding about the population. Population 
mean tells us that on average, the customers have an outstanding balance of $4109.92 on 
their cards. 


Population Mean P <-mean(data$OutsBal) 
cat("The average outstanding balance on cards is ",Population Mean P) 
The average outstanding balance on cards is 4109.92 


3.3.3 Population Variance 


Variance is a measure of spread for the given set of numbers. The smaller the variance, 
the closer the numbers are to the mean and the larger the variance, the farther away 

the numbers are from the mean. For the outstanding balance, the variance is 15974788 
and standard deviation is 3996.8. The variance by itself is not comparable across 
different populations or samples. Variance is required to be seen along with mean of the 
distribution. Standard deviation is another measure and it equals the square root of the 
variance. 
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Population Variance P <-war(data$OutsBal) 

cat("The variance in the average outstanding balance is ",Population Variance P) 
The variance in the average outstanding balance is 15974788 

cat( "Standard deviation of outstanding balance is", sqrt(Population Variance P)) 
Standard deviation of outstanding balance is 3996.847 


3.3.4 Pooled Mean and Variance 


Pooled mean and variance estimate population mean and variance when multiple 
samples are drawn independently of each other. To illustrate the pooled mean and 
variance compared to true population mean and variance, we will first create five random 
samples of size 10K, 20K, 40K, 80K, and 100K and calculate their mean and variance. 

Using these samples, we will estimate the population mean and variance by using 
pooled mean and variance formula. Pooled values are useful because estimates from a 
single sample might produce a large sampling error, whereas if we draw many samples 
from the same population, the sampling error is reduced. The estimate in a collective 
manner will be closer to the true population statistics. 


Note As the sampling fraction is low for the various sample sizes, (for 100K sample size 
f = 100000/10000000 = 1/100) is too large. The variance will not be impacted by the degrees 
of freedom correction by 1, so we can safely use the var() function in R for sample variance. 


In the following R snippet, we are creating five random samples using the sample() 
function. sample() is an built-in function that’s been used multiple times in the book. 
Another thing to note is that the sample() function works with some random seed values, 
so if you want to create reproducible code, use the set .seed(937) function in R. This will 
make sure that each time you run the code, you get the same random sample. 


set.seed(937) 

i<-1 

n<-rbind (10000, 20000 , 40000, 80000, 100000) 
Sampling Fraction<-n/nrow(data) 

sample mean<-numeric() 

sample variance<-numeric() 

for(i in 1:5) 


{ 

sample 100K <-data[sample(nrow(data),size=n[i], replace =FALSE, prob 
=NULL), ] 

sample mean[i]<-round(mean(sample 100K$OutsBal) ,2) 

sample variance[i] <-round(wvar(sample 100K$OutsBal) ,2) 
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Sample statistics <-cbind (1:5,¢('10K','20K','40K','80K','100K'),sample__ 
mean,sample variance, round(sqrt(sample variance) ,2),Sampling Fraction) 


knitr: :kable(Sample statistics, col.names =e("S.No.", "Size", "Sample _ 


Mean , 


Sample Variance", "Sample SD","Sample Fraction") ) 


In Table 3-1, basic properties of the five samples are presented. The highest sample 
fraction is for the biggest sample size. A good thing to notice is that, as the sample size 
increases, the sample variance gets smaller. 


Table 3-1. Sample Statistics 


S.No. Size Sample Mean Sample Variance SampleSD Sample Fraction 


1 10K 4092.48 15921586.32 3990.19 0.001 
2 20K 4144.26 16005696.09 4000.71 0.002 
3 40K 4092.28 15765897.18 3970.63 0.004 
4 80K 4127.18 15897698.44 3987.19 0.008 
5 100K 4095.28 15841598.06 3980.15 0.01 


Now let’s use the pooled mean and variance formula to calculate the population 
mean from the five samples we drew from the population and then compare them with 
population mean and variance. 


i<-1 
Population_mean_Num<-0 
Population_mean_Den<-0 
for(i in 1:5) 


Population mean Num =Population mean Num +sample_mean[i]*n[i] 
Population mean Den =Population mean Den +n[i] 


} 


Population Mean S<-Population mean Num/Population mean Den 


cat("The pooled mean ( estimate of population mean) is",Population Mean _S) 
The pooled mean ( estimate of population mean) is 4108.814 


The pooled mean is $4,108.814. Now we apply this same process to calculate the 
pooled variance from the samples. Additionally, we will show the standard deviation as 
an extra column to make dispersion comparable to the mean measure. 


1<-1 


Population variance _Num<-0 
Population variance Den<-0 
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for(i in 1:5) 
{ 

Population variance Num =Population variance Num +(sample_ 
variance[i])*(n[i] -1) 

Population variance Den =Population variance Den +n[i] -1 


} 


Population Variance S<-Population variance Num/Population variance Den 
Population SD S<-sqrt(Population Variance S) 


cat("The pooled variance (estimate of population variance) is", Population 
Variance S) 

The pooled variance (estimate of population variance) is 15863765 
cat("The pooled standard deviation (estimate of population standard 
deviation) is", sqrt(Population Variance S)) 

The pooled standard deviation (estimate of population standard deviation) 
1s 3982.934 


The pooled standard deviation is $3,982.934. Now we have both pooled statistics and 
population statistics. Here, we create a comparison between the two and see how well the 
pooled statistics estimated the population statistics: 


SamplingError_ percent_mean<-round((Population Mean P -sample mean) / 
Population Mean P,3) 

SamplingError_ percent_variance<-round((Population Variance P -sample_ 
variance)/Population Variance P,3) 


Com Table 1<-cbind(1:5,¢('10K','20K','40K', '80K','100K'), Sampling _ 
Fraction, SamplingError percent_mean,SamplingError_ percent_variance) 


knitr::kable(Com Table 1, col.names =e("S.No.", "Size", "Sampling _ 


Frac , 


Sampling Error Mean(%)", "Sampling Error Variance(%)")) 


Table 3-2 shows the comparison of the population mean and the variance against 
each individual sample. The bigger the sample, the closer the mean estimate to the true 
population estimate. 


Table 3-2. Sample Versus Population Statistics 


S.No. Size Sampling_Frac Sampling_Error Sampling_Error 
_Mean(%) _Variance(%) 

1 10K 1000 0.004 0.003 

2 20K 500 -0.008 -0.002 

3 40K 250 0.004 0.013 

4 80K 125 -0.004 0.005 

5 100K 100 0.004 0.008 


T 


CHAPTER 3 = SAMPLING AND RESAMPLING TECHNIQUES 


Create a same view for pooled statistics against the population statistics. The 
difference is expressed as a percentage of differences among pooled/sample to the true 
population statistics. 


SamplingError percent_mean<-(Population Mean P -Population Mean S)/ 
Population Mean P 

SamplingError_ percent_variance<-(Population Variance P -Population_ 
Variance S)/Population Variance P 


Com Table 2 <-cbind(Population Mean P,Population Mean _S,SamplingError_ 
percent_mean) 

Com Table 3 <-cbind(Population Variance P,Population_ 

Variance S,SamplingError_ percent_variance) 


knitr: :kable(Com Table 2) 


knitr: :kable(Com Table 3) 


Table 3-3. Population Mean and Sample Mean Difference 
Population_Mean_P Population_Mean_S SamplingError_percent_mean 


4109.92 4108.814 0.000269 


Table 3-4. Population Variance and Sample Variance 
Population_Variance_P Population_Variance_S | SamplingError_percent_variance 


15974788 15863765 0.0069499 


Pooled mean is close to the true mean of the population. This shows that given 
multiple sample the pooled statistics are more likely to capture true statistics values. You 
have now seen how a sample so small in size when compared to population gives you 
estimates so close to the population estimate. 

Does this example give you a tool of dealing with big data by using small samples 
from them? By now you might have started thinking about the cost-benefit analysis of 
using sampling. This is very relevant to machine learning algorithms churning millions of 
data points. More data points does not necessarily mean all of them contain meaningful 
patterns/trends/information. Sampling will try to save you from weeds and help you 
focus on meaningful datasets for machine learning. 


3.4 Business Implications of Sampling 


Sampling is applied at multiple stages of model development and decision making. 
Sampling methods and interpretation are driven by business constraints and statistical 
methods chosen for inference testing. There is a delicate balance set by data scientists 
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between the business implications and how statistical results stay valid and relevant. 
Most of the time, the problem is given by business and data scientists have to work in a 
targeted manner to solve the problem. 

For instance, suppose the business wants to know why customers are not returning 
to their web site. This problem will dictate the terms of sampling. To know why customers 
are not coming back, do you really need a representative sample of whole population of 
your customers? Or you will just take a sample of customers who didn’t return? Or rather 
you would like to only study a sample of customers who return and negate the results? 
Why not create a custom mixed bag of all returning and not returning customers? As 
you can observe, lot of these questions, along with practical limitations on time, cost, 
computational capacity, etc. will be deciding factors on how to go about gathering data 
for this problem. 

In general, the scenarios listed next are salient features of sampling and some 
shortcomings that need to be kept in mind while using sampling in your machine 
learning model building flow 


3.4.1 Features of Sampling 
e Scientific in nature 
e Optimizes time and space constraints 
e Reliable method of hypothesis testing 
e Allows in-depth analysis by reducing cost 


e Incases where population is very large and infrastructure is a 
constraint, sampling is the only way forward 


3.4.2 Shortcomings of Sampling 
e Sampling bias can cause wrong inference 


e Representative sampling is always not possible due to size, 
type, requirements, etc. 


e  Jtis not exact science but an approximation within certain 
confidence limits 


e Insample survey, we have issues of manual errors, inadequate 
response, absence of informants, etc. 


3.5 Probability and Non-Probability Sampling 


The sampling methodology largely depends on what we want to do with the sample. 
Whether we want to generate a hypothesis about population parameters or want to test a 
hypothesis We classify sampling method into two major buckets—probability sampling 
and non-probability sampling. The comparison in Figure 3-2 provides the high-level 
differences between them. 
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Non-Probability (Non-Random) 
Sampling 

Can be generalized to population defined |Cannot be generalized beyond the sample 

by sampling frame 

Can apply statistical methods, Hypothesis |Exploratory research, helps in generating 


Probability (Random) Sampling 


testing, confidence bounds Hypothesis, Analytical inference 

Can estimate population Population statistics/parameters are not of 
statistics/parameters from sample interest 

Reduces Bias by varying sample design | Biased, Sample be known 





Random selection from population l No defined population; Cheaper, easier 
and convenient to carry out 


Figure 3-2. Probability versus non-probability sampling 





In probability sampling, the sampling methods draws each unit with some finite 
probability. The sampling frame that maps the population unit to sample unit is created 
based on the probability distribution of the random variable utilized for sampling. These 
types of methods are commonly used for model sampling, and have high reliability 
to draw population inference. They eliminate bias in parameter estimation and can 
be generalized to the population. Contrary to non-probability sampling, we need to 
know the population beforehand to sample from. This makes this method costly and 
sometimes difficult to implement. 

Non-probability sampling is sampling based on subjective judgment of experts 
and business requirements. This is a popular method where the business needs 
don’t need to align with statistical requirements or it is difficult to create a probability 
sampling frame. The non-probability sampling method does not assign probability to 
population units and hence it becomes highly unreliable to draw inferences from the 
sample. Non-probability sampling have bias toward the selected classes as the sample 
is not representative of population. Non-probability methods are more popular with 
exploratory research for new traits of population that can be tested later with more 
statistical rigor. In contrast to probability techniques, it is not possible to estimate 
population parameters with accuracy using non-probability techniques. 


3.5.1 Types of Non-Probability Sampling 


In this section, we briefly touch upon the three major types of non-probability sampling 
methods. As these techniques are more suited for survey samples, we will not discuss 
them in detail. 


3.5.1.1 Convenience Sampling 


In convenience sampling, the expert will choose the data that is easily available. This 
technique is the cheapest and consumes less time. For our case, suppose that the data 
from New York is accessible but for other states, the data is not readily accessible so 
we choose data from one sate to study whole United States. The sample would not 

be a representative sample of population and will be biased. The insights also cannot 


80 


CHAPTER 3 ™ SAMPLING AND RESAMPLING TECHNIQUES 


generalized to the entire population. However, the sample might allow us to create some 
hypothesis that can later be tested using random samples from all the states. 


3.5.1.2 Purposive Sampling 


When the sampling is driven by the subjective judgment of the expert, it’s called purposive 
sampling. In this method the expert will sample those units which help him establish the 
hypothesis he is trying to test. For our case, if the researcher is only interested in looking 

at American Express cards, he will simply choose some units from the pool of records 
from that card type. Further, there are many types of purposive sampling methods, e.g., 
maximum variance sampling, extreme case sampling, homogeneous sampling etc. but 
these are not discussed in this book, as they lack representativeness of population which is 
required for unbiased machine learning methods. 


3.5.1.3 Quota Sampling 


As the name goes, quota sampling is based on a prefixed quota for each type of cases, 
usually the quota decided by an expert. Fixing a quota grid for sampling ensures equal 
or proportionate representation of subjects being sampled. This technique is popular in 
marketing campaign design, A/B testing, and new feature testing. 

In this chapter we cover sampling methods with examples drawn from our credit 
card fraud data. We encourage you to explore more non-probability sampling in context 
of the business problem at your disposal. There are times when experience can beat 
statistics, so non-probability sampling is equally important in many use cases. 


3.6 Statistical Theory on Sampling Distributions 


Sampling techniques draw their validity from well-established theorems and time-tested 
methods from statistics. For studying sampling distribution, we need to understand two 
important theorems from statistics: 


e Lawof Large Numbers 
° Central Limit Theorem 


This section explains these two theorems with some simulations. 


3.6.1 Law of Large Numbers: LLN 


In general, as the sample size increases in a test, we expect the results to be more 
accurate, having smaller deviations in the expected outcomes. The law of large numbers 
formalizes this with help of the probability theory. The first notable reference to this 
concept was given by Italian mathematician Gerolamo Cardano in the 16th century, 
when he observed and stated that empirical statistics get closer to their true value as the 
number of trials increases. 
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In later years, a lot of work was done to get different form of the Law of Large 
Numbers. The example we are going to discuss for a coin toss was first proved by 
Bernoulli and later he provided proof of his observations. Aleksander Khinchin provided 
the most popular statement for the Law of Large Numbers, also called the weak law of 
large numbers. The weak law of large number is alternatively called the law of averages. 


3.6.1.1 Weak Law of Large Numbers 


In probability space, the sample average converges to an expected value as the sample 
size trends to infinity. In other words, as the number of trials or sample size grows, the 
probability of getting close to the true average increases. The weak law is also called 
Khinchin’s Law to recognize his contribution. 

The weak law of large numbers states that the sample averages converge in 
probability toward the expected value, X, >u when n>. 


Alternatively, for any positive number € 


lim Pr(|X, —u|>e)=0. 





3.6.1.2 Strong Law of Large Numbers 


It is important to understand the subtle difference between the weak and strong law 
of large numbers. The strong law of large numbers states that the sample average will 
converge to true average by probability 1, while the weak law only states that they will 
converge. Hence, the strong law is more powerful to state while estimating population 
mean by sample means. 

The strong law of large numbers states that the sample average converges almost 


surely to the expected value X, 3 u when n > oœ. 


Equivalent to 


Pr(lim X = u)=1. 


n—->o 


Note There are multiple representations and proofs for the Law of Large Numbers. You 
are encouraged to refer to any graduate level text of probability to learn more. 


Without getting into the statistical details of this theorem, we will set up an example 
to understand. Consider a coin toss example whereby a coin toss outcome follows a 
binomial distribution. 
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Suppose you have a biased coin and you have to determine what the probability is 
of getting “heads” in any toss of the coin. According to LLN, if you perform the coin toss 
experiment multiple times, you will be able to find the actual probability of getting heads. 

Please note that for a unbiased coin you can use the classical probability theory 
and get the probability, P(head)= Total no. of favorable outcomes/Total number 
of outcomes = 1/2. But for an unbiased coin, you have unequal probability associated 
with each event and hence cannot use the classical approach. We will set up a coin toss 
experiment to determine the probability of getting heads in a coin toss. 


3.6.1.3 Steps in Simulation with R Code 


Step 1: Assume some value of binomial distribution parameter, p=0.60(say), which we 
will be be estimating using the Law of Large Numbers 


# Set parameters for a binomial distribution Binomial(n, p) 
# n -> no. of toss 

# p -> probability of getting a head 

library (data.table) 

n <-100 

p <-0.6 


In the previous code snippet, we set the true mean for our experiment. Which is to 
say we know that our population is coming from a binomial distribution with p=0.6. The 
experiment will now help us estimate this value as the number of experiments increases. 

Step 2: Sample a point from binomial distribution (p). 


#Create a data frame with 100 values selected samples from Binomial (1,p) 
set.seed(917) ; 
dt <-data.table(binomial =rbinom(n, 1, p) ,count_of_ heads =0, mean =0) 


# Setting the first observation in the data frame 


ifelse(dt$binomial[1] ==1, dt[1, 2:3] <-1, 0) 
[1] 1 


We are using a built-in function rbinom() to sample binomial distributed random 
variable with parameter, p=0.6. This probability value is chosen such that the coin is 
biased. If the coin is not biased, then we know the probability of heads is 0.5. 

Step 3: Calculate the probability of heads as the number of heads/total no. of coin toss. 


# Let's run a experiment large number of times (till n) and see how the 
average of heads -> probability of heads converge to a value 
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for (i in 2 :n) 
{ 
dt$count_of heads[i] <-ifelse(dt$binomial[i] ==1, dt$count_of heads[i]<- 
dt$count_of heads[i -1]+1, dt$count_of heads[i -1]) 
dt$mean[i] <-dt$count_of heads[i] /i 
} 


At each step, we determine if the outcome is heads or tails. Then, we count the 
number of heads and divide by the number of trials to get an estimated proportion of 
heads. When you run the same experiment a large number of times, LLN states that you 
will converge to the probability (expectation or mean) of getting heads in a experiment. 
For example, at trial 30, we will be counting how many heads so far and divide by 30 to get 
the average number of heads. 

Step 4: Plot and see how the average over the sample is converging to p=0. 60. 


# Plot the average no. of heads -> probability of heads at each experiment stage 
plot(dt$mean, type='1', main ="Simulation of average no. of heads", 
xlab="Size of Sample", ylab="Sample mean of no. of Heads") 

abline(h = p, col="red") 


Figure 3-3 shows that as the number of experiments increases, the probability 
is converging to the true probability of heads (0.6). You are encouraged to run the 
experiment a large number of times to see the exact convergence. This theorem helps us 
estimate unknown probabilities by method of experiments and create the distribution for 
inference testing. 


Simulation of average no. of heads 


0.7 0.8 0.9 1.0 


Sample mean of no. of Heads 
0.6 





0.5 


0 20 40 60 80 100 


Size of Sample 


Figure 3-3. Simulation of the coin toss experiment 
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3.6.2 Central Limit Theorem 


The Central Limit Theorem is another very important theorem in probability theory 
which allows hypothesis testing using the the sampling distribution. In simpler words, 
the Central Limit Theorem states that sample averages of large number of iterations 
of independent random variables, each with well-defined means and variances, are 
approximately normally distributed. 

The first written explanation of this concept was provided by de Moivre in his 
work back in the early 18th century when he used normal distribution to approximate 
the number of heads from the tossing experiment of a fair coin. Pierre-Simon Laplace 
published Théorie analytique des probability in 1812, where he expanded the idea of de 
Moivre by approximating binomial distribution with normal distribution. The precise 
proof of CLT was provided by Aleksandr Lyapunov in 1901 when he defined it in general 
terms and proved precisely how it worked mathematically. In probability, this is one the 
most popular theorems along with the Law of Large Numbers. 

In context of this book, we will mathematically state by far the most popular version 
of the Central Limit Theorem (Lindeberg-Levy CLT). 

For a sequence of i.i.d random variables {X1, X2, ...} with a well defined expectation 


and variance (E[Xi] = u and Var[Xi] = o%< œ ), as n trends to infinity /n (S,—p) converge 


in distribution to a normal N(0, sigma’), 
1 € i 2 
Vn| | —>°X, |-u | >N(0, o°) 
N i= 


There are other versions of this theorem, such as Lyapunov CLT, Lindeberg CLT, 
Martingale difference CLT, and many more. It is important to understand how the Law 
of Large Numbers and the Central Limit Theorem tie together in our sampling context. 
The Law of Large Numbers states that the sample mean converges to the population 
mean as the sample size grows, but it does not talk about distribution of sample means. 
The Central Limit Theorem provides us with the insight into the distribution around 
mean and states that it converges to a normal distribution for large number of trials. 
Knowing the distribution then allows us to do inferential testing, as we are able to create 
confidence bounds for a normal distribution. 

We will again set up a simple example to explain the theorem. As a simple example, 
we will start sampling from a exponential distribution and will show the distribution of 
sample mean. 


3.6.2.1 Steps in Simulation with R Code 


Step 1: Set a number of samples (say r=5000) to draw from a mixed population. 


#Number of samples 
r<-5000 

#Size of each sample 
n<-10000 
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In the previous code, r represented the number of samples to draw, and n 
represented the number of units in each sample. As per CLT, the larger the number of 
samples, the better the convergence to a normal distribution. 

Step 2: Start sampling by drawing a sample of sizes n (say n=10000 each). Draw 
samples from normal, uniform, Cauchy, gamma, and other distributions to test the 
theorem for different distributions. Here, we take an exponential distribution with 
parameter (A =0.6 ) 


#Produce a matrix of observations with n columns and r rows. Each row is 
one sample 

lambda<-0.6 

Exponential Samples =matrix(rexp(n*r, lambda) ,r) 


Now, the Exponential Samples data frame contain the series of i.i.d samples drawn 
from exponential distribution with the parameter, 4 =0.6 


Step 3: Calculate the sum, means, and variance of all the samples for each sample. 


all.sample.sums <-apply(Exponential Samples,1,sum) 
all.sample.means <-apply(Exponential Samples,1,mean) 
all.sample.vars <-apply(Exponential Samples,1,var) 


The previous step calculated the sum, mean, and variance of all the i.i.d samples. 
Now in next step, we will observe the distribution of the sums, means, and variances. As 
per CLT, we will observe that the mean is following a normal distribution. 

Step 4: Plot the combined sum, means, and variances. 


par (mfrow=¢(2,2)) 
hist(Exponential Samples[1, ],col="gray",main="Distribution of One Sample") 
hist(all.sample.sums,col="gray",main="Sampling Distribution of 

the Sum") 
hist(all.sample.means,col="gray",main="Sampling Distribution of the Mean") 
hist(all.sample.vars,col="gray",main="Sampling Distribution of 

the Variance") 


Figure 3-4 shows the plots of a exponential sample and sum, mean, and standard 
deviation of the all r samples. 
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Figure 3-4. Sampling distribution plots 


Figure 3-4 shows the distribution of statistics of the samples, i.e., the sum, mean, 
and variance. The first plot shows the histogram of the first sample from the exponential 
distribution. You can see the distribution of units in sample is exponential. The 
visual inspection shows that the statistics estimated from i.i.d samples are following a 
distribution close to a normal distribution. 

Step 5: Repeat this experiment with other distributions and see that the results are 
consistent with the CLT for all the distributions. 

There are some other standard distributions that can be used to validate the results 
of Central Limit Theorem. Our example just discussed the exponential distribution; you 
are encouraged to use the following distribution to validate the Central Limit Theorem. 


Normal Samples =matrix(rnorm(n*r,param1,param2),r), 
Uniform Samples =matrix(runif(n*1r,param1,param2) ,r), 
Poisson Samples =matrix(rpois(n*r,param1),r), 

Cauchy Samples =matrix(rcauchy(n*r,param1,param2),r), 
Bionomial Samples =matrix(rbinom(n*r,param1,param2),r), 
Gamma Samples =matrix(rgamma(n*r,param1,param2),r), 
ChiSqr Samples =matrix(rchisq(n*r,param1),r), 


StudentT Samples =matrix(rt(n*r,param1) ,r)) 
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It is a good practice to not rely on visual inspection and perform formal tests to infer 
any properties of distribution. Histogram and a formal test of normality is a good way to 
establish both visually and by parametric testing that the distribution of means is actually 
normally distributed (as claimed by CLT). 

Next, we perform a Shapiro-Wilk test to test for normality of distribution of means. 
Other normality tests are discussed in briefly in Chapter 6. One of the most popular non- 
parametric normality tests is the KS one sample test, which is discussed in Chapter 7. 


#Do a formal test of normality on the distribution of sample means 


Mean_of sample means <-mean (all.sample.means) 
Variance of sample means <-war(all.sample.means) 


# testing normality by Shapiro wilk test 
shapiro.test(all.sample.means) 


Shapiro-Wilk normality test 


data: all.sample.means 
W = 0.99979, p-value = 0.9263 


You can see that the p-value is significant (>0.05) from the Shapiro-Wilk test, which 
means that we can’t reject the Null hypothesis that distribution is normally distributed. 
The distribution is indeed normally distributed with a mean = 1.66 and variance = 0.00027. 

Visual inspection can be done by plotting histograms. Additionally, for clarity, let’s 
superimpose the normal density function on the histogram to confirm if the distribution 
is normally distributed. 


x <-all.sample.means 

h<-hist(x, breaks=20, col="red", xlab="Sample Means", 

main="Histogram with Normal Curve") 

xfit<-seq(min(x) ,max(x), length=40) 
yfit<-dnorm(xfit,mean=Mean of sample means, sd=sqrt(Variance of sample means) ) 
yfit <-yfit*diff(h$mids[1:2])*length(x) 

lines(xfit, yfit, col="blue", lwd=2) 
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Histogram with Normal Curve 
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Figure 3-5. Distribution of sample means with normal density lines 


The most important points to remember about the Law of Large Numbers and CLT are: 


e — As the sample size grows large, you can expect a better estimate of 
the population/model parameters. This being said, a large sample 
size will provide you with unbiased and more accurate estimates 
for hypothesis testing. 


e The Central Limit Theorem helps you get a distribution and hence 
allows you to get a confidence interval around parameters and 
apply inference testing. The important thing is that CLT doesn’t 
assume any distribution of population from which samples are 
drawn, which frees you from distribution assumptions. 


3.7 Probability Sampling Techniques 


In this section, we introduce some of the popular probability sampling techniques and 
show how to perform them using R. All the sampling techniques are explained using our 
credit card fraud data. As a first step of explaining the individual techniques, we create the 
population statistics and distribution and then compare the same sample properties with 
the population properties to ascertain the sampling outputs. 


3.7.1 Population Statistics 


We will look at some basic features of data. These features will be called as population 
statistics/parameters. We will then show different sampling methods and compare the 
result with population statistics. 
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1. Population data dimensions 


str() shows us the column names, type, and few values in the 
column. You can see the dataset is a mix of integers and characters. 


str(data) 

Classes ‘data.table' and ‘data.frame': 10000000 obs. of 14 variables: 
$ creditLine : int 1111111111... 
$ gender tint 2h 1 dt td, as 
$ state > int 1111111111... 


$ CustomerID : int 4446 59161 136032 223734 240467 248899 262655 324670 
390138 482698 ... 

$ NumOfCards : int 1111111111... 

$ OutsBal : int 2000 O 2000 2000 2000 O O 689 2000 0 ... 

$ DomesTransc: int 31 25 78 11 40 47 15 17 48 25... 

$ IntTransc : int 90300009035... 

$ FraudFlag > int 0000000000... 


$ State : chr "Alabama" "Alabama" "Alabama" "Alabama" ... 

$ PostalCode : chr “AL” “AL" "AL" "AL" ... 

$ Gender : chr "Male" "Male" "Male" "Male" ... 

$ CardType : chr “American Express" “American Express" “American 
Express" “American Express’ ... 

$ CardName : chr “SimplyCash® Business Card from American Express" 


"SimplyCash® Business Card from American Express" "SimplyCash® Business Card 
from American Express" "SimplyCash® Business Card from American Express" ... 
- attr(*, ".internal.selfref")=<externalptr> 
- attr(*, "sorted")= chr "creditLine" 


2. Population mean for measures 


a. Outstanding balance: On average each card carries an 
outstanding amount of $4109.92. 


mean_outstanding balance <- mean(data$OutsBal) 
mean_outstanding balance 


[1] 4109.92 


b. Number of international transactions: Average number of 
international transactions is 4.04. 


mean_international_trans <- mean(data$IntTransc) 
mean_international_trans 


[1] 4.04719 


c. Number of domestic transactions: Average number 
of domestic transactions is very high compared to 
international transactions; the number is 28.9 ~ 29 
transactions. 
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mean_domestic_trans <- mean(data$DomesTransc) 
mean_domestic_trans 


[1] 28.93519 
3. Population variance for measures 
a. Outstanding balance: 


Var_outstanding balance <- var(data$0utsBal) 
Var_outstanding balance 
[1] 15974788 


b. Number of international transactions: 


Var_international_trans <- var(data$IntTransc) 
Var_international_trans 
[1] 74.01109 


c. Number of domestic transactions: 


Var_domestic_trans <- var(data$DomesTransc) 
Var_domestic_trans 
[1] 705.1033 


4. Histogram 


a. Outstanding balance: 


hist(data$OutsBal, breaks=20, col="red", xlab="Outstanding Balance", 
main="Distribution of Outstanding Balance") 


Distribution of Outstanding Balance 
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Figure 3-6. Histogram of outstanding balance 
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b. Number of international transactions: 
hist(data$IntTransc, breaks=20, col="blue", xlab="Number of International 


Transactions", 
main="Distribution of International Transactions") 


Distribution of International Transactions 
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Figure 3-7. Histogram of number of international transactions 


c. Number of domestic transactions: 
hist(data$DomesTransc, breaks=20, col="green", xlab="Number of Domestic 


Transactions", 
main="Distribution of Domestic Transactions") 
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Figure 3-8. Histogram of number of domestic transactions 


Figure 3-8 shows the mean, variance, and distribution of few important variables 
from our credit card fraud dataset. These population statistics will be compared to sample 
statistics to see which sampling techniques provide a representative sample. 


3.7.2 Simple Random Sampling 


Simple random sampling is a process of selecting a sample from the population where 
each unit of population is selected at random Each unit has the same individual 
probability of being chosen at any stage during the sampling process, and the subset of k 
individuals has the same probability of being chosen for the sample as any other subset of 
k individuals. 

Simple random is a basic type of sampling, hence it can be a component of more 
complex sampling methodologies. In coming topics, you will see simple random 
sampling form an important component of other probability sampling methods, like 
stratified sampling and cluster sampling. 

Simple random sampling is typically without replacement, i.e., by the design of 
sampling process, we make sure that no unit can be selected more than once. However, 
simple random sampling can be done with replacement, but in that case the sampling 
units will not be independent. If you draw a small size sample from a large population, 
sampling without replacement and sampling with replacement will give approximately 
the same results, as the probability of each unit to be chosen is very small. Table 3-5 
compares the statistics from simple random sampling with and without replacement. The 
values are comparable, as the population size is very big (~10 million). We will see this 
fact in our example. 
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Table 3-5. Comparison Table with Population Sampling with and Without Replacement 


CardType OutstandingBalance_ OutstandingBalance_ OutstandingBalance_ 
Population Random_WOR Random_WR 
American Express 3820.896 3835.064 3796.138 
Discover 4962.420 4942.690 4889.926 
MasterCard 3818.300 3822.632 3780.691 
Visa 4584.042 4611.649 4553.196 
Advantages: 


e  Jtis free from classification error 
e Not much advanced knowledge is required of the population 
e Easy interpretation of sample data 

Disadvantages: 


e Complete sampling frame (population) is required to get 
representative sample 


e Data retrieval and storage increases cost and time 


e Simple random sample carries the bias and errors present in the 
population, and additional interventions are required to get rid of those 


Function: Summarise 
Summarise is a function in the dplyr library. This function helps aggregate the data 
by dimensions. This works similar to a pivot table in Excel. 


e group by: This argument takes the categorical variable by which 
you want to aggregate the measures. 


e mean(OutsBal): This argument gives the aggregating function and 
the field name on which aggregation is to be done. 


#Population Data :Distribution of Outstanding Balance across Card Type 
library (dplyr) 


summarise(group_by(data,CardType) ,Population OutstandingBalance=mean(OutsBal) ) 
Source: local data table [4 x 2] 


CardType Population OutstandingBalance 


1 American Express 3820.896 
2 MasterCard 3818. 300 
3 Visa 4584.042 
4 Discover 4962.420 
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The call of the summarise function by CardType shows the average outstanding 
balance by card type. Discover cards have the highest average outstanding balance. 

Next, we draw a random sample of 100,000 records by using a built-in function 
sample() from the base library. This function creates a sampling frame by randomly 
selecting indexes of data. Once we get the sampling frame, we extract the corresponding 
records from the population data. 

Function: Sample 

Note some important arguments of the sample() function: 


e nrow(data): This tells the size of data. Here it is 10,000,000 and 
hence it will create an index of 1 to 10,000,000 and then randomly 
select index for sampling. 


e size: Allows users to provide how many data points to sample 
from the population. In our case, we have set n to 100,000. 


e replace: This argument allows users to state if the sampling 
should be done without replacement (FALSE) or with 
replacement (TRUE). 


e prob: This is vector of probabilities for obtaining the sampling 
frame. We have set this to NULL, so that they all have the same 
weight/probability. 


set.seed (937) 
# Simple Random Sampling Without Replacement 
library("base" ) 


sample SOR 100K <-data[sample(nrow(data) ,size=100000, replace =FALSE, prob 
=NULL), ] 


Now, let’s again see how the average balance looks for the simple, random sample. 
As you can see the order of average balances has been maintained. Note that the average 
is very close to the population average as calculated in the previous step for population. 
#Sample Data : Distribution of Outstanding Balance across Card Type 
library(dplyr) 
summarise(group_by(sample SOR 100K,CardType) ,Sample OutstandingBalance=mean 
(OutsBal) ) 

Source: local data table [4 x 2] 


CardType Sample OutstandingBalance 


1 MasterCard 3822.632 
2 American Express 3835.064 
3 Visa 4611.649 
4 Discover 4942.690 
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Function: KS.test() 


This is one of the non-parametric tests for comparing the empirical distribution 
functions of data. This function helps determine if the data points are coming from 
the same distribution or not. It can be done as one sample test, i.e., the data Empirical 
Distribution Function (EDF) compared to some preset PDF of distribution (normal, 
cauchy, etc.), two sample test, i.e., when we want to see if the distribution of two samples 
is the same or not. 

As one of the important features of sampling is to make sure the distribution of data 
does not change after sampling (except when it is done intentionally), we will use two 
sample tests to see if the sample is a true representation of the population by checking if 
the population and sample are drawn from the same distribution. 

The important arguments are the two data series and hypothesis to test, two tail, one 
tail. We are choosing the more conservative two tail test for this example. Two tail test 
means that we want to make sure the equality is used in the null hypothesis. 


#Testing if the sampled data comes from population or not. This makes sure 

that sampling does not change the original distribution 

ks .test(data$OutsBal,sample SOR 100K$OutsBal, alternative="two.sided" ) 
Two-sample Kolmogorov-Smirnov test 


data: data$OutsBal and sample SOR 100K$0OutsBal 
D = 0.003042, p-value = 0.3188 
alternative hypothesis: two-sided 


The KS test results clearly states that the sample and population have the same 
distribution. Hence, we can say that the sampling has not changed the distribution. By 
the nature of sampling, the simple random sampling without replacement retains the 
distribution of data. 

For visual inspection, Figure 3-9 shows the histograms for the population and the 
sample. As you can see, the distribution is the same for both. 


par(mfrow =¢(1,2)) 
hist(data$OutsBal, breaks=20, col="red", xlab="Outstanding Balance", 
main="Histogram for Population Data") 


hist(sample SOR 100K$OutsBal, breaks=20, col="green", xlab="Outstanding 


Balance", 
main="Histogram for Sample Data (without replacement)") 
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Figure 3-9. Population versus sample (without replacement) distribution 


Now we will do a formal test on the mean of the outstanding balance from the 
population and our random sample. Theoretically, we will expect the t-test on means of 
two to be TRUE and hence we can say that the mean of the sample and population are the 
same with 95% confidence. 


# Lets also do t.test for the mean of population and sample. 
t.test(data$OutsBal,sample SOR 100K$0utsBal) 
Welch Two Sample t-test 


data: data$OutsBal and sample SOR 100K$0utsBal 
t = -0.85292, df = 102020, p-value = 0.3937 
alternative hypothesis: true difference in means is not equal to O 
95 percent confidence interval: 

-35.67498 14.04050 
sample estimates: 
mean of x mean of y 

4109.920 4120.737 


These results show that the means of population and sample are the same as the 


p-value of t. test is insignificant. We cannot reject the null hypothesis that the means 
are equal. 
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Here we will show you similar testing performed for simple random sample with 
replacement. As you will see, we don’t see any significant change in the results as 
compared to simple random sample, as the population size is very big and replacement 
essentially doesn’t alter the sampling probability of record in any material way. 


set.seed (937) 
# Simple Random Sampling With Replacement 
library("base" ) 


sample SR 100K <-data[sample(nrow(data),size=100000, replace =TRUE, prob =NULL), | 


In this code, for simple random sampling with replacement, we set replace to TRUE in 
the sample function call. 

The following code shows how we performed the KS test, on distribution of the 
population and the sample drawn with replacement. The test shows that the distributions 
are the same and the p-value is insignificant, which fails to reject the null of equal 
distribution. 


ks .test(data$OutsBal,sample SR 100K$OutsBal,alternative="two.sided") 
Warning in ks.test(data$OutsBal, sample SR 100K$OutsBal, alternative = 
"two.sided"): p-value will be approximate in the presence of ties 


Two-sample Kolmogorov-Smirnov test 


data: data$OutsBal and sample SR 100K$OutsBal 
D = 0.0037522, p-value = 0.1231 
alternative hypothesis: two-sided 


We create the histogram of population and sample with replacement. The plots 
look the same, coupled with a formal KS test, we see both sample with replacement and 
populations have the same distribution. Be cautious about this when the population size 
is small. 


par(mfrow =¢(1,2)) 
hist(data$OutsBal, breaks=20, col="red", xlab="Outstanding Balance", 


main="Population ") 


hist(sample SR 100K$OutsBal, breaks=20, col="green", xlab="Outstanding Balance", 
main=" Random Sample Data ( WR)") 
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Figure 3-10. Population versus sample (with replacement) distribution 


The distribution is similar for population and random sample drawn with 
replacement. We summarize the simple random sampling by comparing the summary 
results from population, simple random sample without replacement, and simple 
random sample with replacement. 


population summary <-summarise(group_by(data,CardType) ,OutstandingBalan 

ce Population=mean(OutsBa1) ) 

random WOR summary<-summarise(group_by(sample SOR 100K, CardType) , Outstanding 
Balance Random WOR=mean(OutsBal) ) 

random WR summary<-summarise(group_by(sample SR 100K,CardType) ,OutstandingBa 
lance Random WR=mean(OutsBal) ) 

compare population WOR<-merge(population summary,random WOR summary, 
by="CardType" ) 

compare population WR <-merge(population summary,random WR summary, 
by="CardType" ) 

summary compare<-cbind(compare population WOR,compare population_ 

WR[ ,OutstandingBalance Random WR] ) 

colnames(summary compare)|[which(names(summary compare) == "V2")] <- 
“OutstandingBalance Random WR" 


knitr: :kable(summary compare) 
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Table 3-5 shows that both with and without replacement, simple random sampling 
gave similar values of mean across card types, which were very close to the true mean of 
the population. 

Key points: 


e Simple random sampling gives representative samples from the 
population. 


e Sampling with and without replacement can give different results 
with different samples sizes, so extra care should be paid to 
choosing the method when population size is small. 


e The appropriate sample size for each problem differ based on 
the confidence we want with our testing, business purposes, 
cost benefit analysis, and other reasons. You will get a good 
understanding of what is happening in each sampling techniques 
and can choose the best one that suits the problem at hand. 


3.7.3 Systematic Random Sampling 


Systematic sampling is a statistical method in which the units are selected with a 
systematically ordered sampling frame. The most popular form of systematic sampling is 
based on a circular sampling frame, where we transverse the population from start to end 
and then again continue from start in a circular manner. In this approach the probability 
of each unit to be selected is the same and hence this approach is sometimes called the 
equal-probability method. But you can create other systematic frames according to your 
need to perform systematic sampling. 

In this section, we discuss the most popular circular approach to systematic random 
sampling. In this method, sampling starts by selecting a unit from the population at 
random and then every Kth element is selected. When the list ends, the sampling starts 
from the beginning. Here, the k is known as the skip factor, and it’s calculated as follows 


= 
n 


where N is population size and n is sample size. 

This approach to systematic sampling makes this functionally similar to simple 
random sampling, but it is not the same because not every possible sample of a certain 
size has an equal probability of being chosen (e.g., the seed value will make sure that the 
adjacent elements are never selected in the sampling frame). However, this method is 
efficient if variance within the systematic sample is more than the population variance. 

Advantages: 


e Easy to implement 


e Can be more efficient 
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Disadvantages: 
e Can be applied when the population is logically homogeneous 


e There can be a hidden pattern in the sampling frame, causing 
unwanted bias 


Let’s create an example of systematic sampling from our credit card fraud data. 

Step 1: Identify a subset of population that can be assumed to be homogeneous. A 
possible option is to subset the population by state. In this example, we use Rhode Island, 
the smallest state in the United States, to assume homogeneity. 


Note Creating homogeneous sets from the population by some attribute is discussed 
in Chapter 6. 


For illustration purposes, let’s create a homogeneous set by subsetting the 
population with the following business logic. Subset the data and pull the records whose 
international transactions equal 0 and domestic transactions are less than or equal to 3. 

Assuming the previous subset forms a set of homogeneous population, the 
assumption in subsetting is also partially true as the customers who do not use card 
domestically are likely not to use them internationally at all. 


Data Subset <-subset(data, IntTransc==08DomesTransc<=3) 
summarise(group_by(Data Subset, CardType) ,OutstandingBalance=mean(OutsBa1) ) 
Source: local data table [4 x 2] 


CardType OutstandingBalance 


1 American Express 3827.894 
2 MasterCard 3806. 849 
3 Visa 4578.604 
4 Discover 4924.235 


Assuming the subset has homogeneous sets of cardholders by card type, we can 
go ahead with systematic sampling. If the set is not homogeneous, then it’s highly 
likely that systematic sampling will give a biased sample and hence not provide a true 
representation of the population. Further, we know that the data is stored in R data frame 
by an internal index (which will be the same as our customer ID), so we can rely on 
internally ordered index for systematic sampling. 

Step 2: Set a sample size to sample from the population. 


#Size of population ( here the size of card holders from Data Subset) 
Population Size N<-length(Data_ Subset$OutsBal) 


# Set a the size of sample to pull (should be less than N), n. We will 
assume n=5000 


Sample Size _n<-5000 
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Step 3: Calculate the skip factor using this formula: 


a 
n 


The skip factor will give the jump factor while creating the systematic sampling 
frame. Essentially, with a seed (or starting index) of c, items will be selected after skipping 
k items in order. 


#Calculate the skip factor 
k =ceiling(Population Size N/Sample Size n) 


#ceiling(x) rounds to the nearest integer thata<U+0080><U+0099>s larger than x. 
#This means ceiling (2.3) = 3 


cat("The skip factor for systematic sampling is ",k) 
The skip factor for systematic sampling is 62 


Step 4: Set arandom seed value of index and then create a sequence vector with seed 
and skip (sample frame). This will take a seed value of index, say i, then create a sampling 
frame as i,i+k,i+2k ...so one until it has a total of n (sample size) indexes in the sampling frame. 


r =Ssample(1:k, 1) 
systematic sample index =seq(r, r +k*(Sample Size n-1), k) 


Step 5: Sample the records from the population by sample frame. Once we have 
our sampling frame ready, it is nothing but list of indices, so we pull those data records 
corresponding to the sampling frame. 


systematic sample 5K<-Data Subset|systematic sample index, | 


Let’s now compare the systematic sample with a simple random sample of the same 
size of 5000. As from the previous discussion, we know that the simple random sampling 
is a true representation of the population, so we can use that as a proxy for population 
properties. 


set.seed (937) 
# Simple Random Sampling Without Replacement 
library("base" ) 


sample Random 5K <-Data Subset[sample(nrow(Data Subset),size=5000, replace 
=FALSE, prob =NULL), ] 


Here is the result of summary comparison by card type for outstanding balances. The 


comparison is important to show what differences in mean will appear if we would have 
chosen a simple random sample instead of a systematic sample. 
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sys summary <-summarise(group_by(systematic sample 5K,CardType) ,OutstandingB 
alance Sys=mean(OutsBal) ) 

random _summary<-summarise(group_by(sample Random 5K,CardType) ,OutstandingBal 
ance_Random=mean(OutsBal) ) 


summary mean _compare<-merge(sys summary,random summary, by="CardType" ) 


print(summary mean compare) 
Source: local data table [4 x 3] 


CardType OutstandingBalance Sys OutstandingBalance Random 


1 American Express 3745.873 3733.818 
2 Discover 5258.751 4698.375 
3 MasterCard 3766.037 3842.121 
4 Visa 4552.099 4645.664 


Again, we will emphasize on testing the sample EDF with population EDF to make 
sure that the sampling has not distorted the distribution of data. This steps will be 
repeated for all the sampling techniques, as this ensures that the sampling is stable for 
modeling purposes. 


ks.test(Data_Subset$0utsBal,systematic_sample_5K$0utsBal,alternative="two. 
sided") 

Warning in ks.test(Data Subset$OutsBal, systematic_sample_5K$0utsBal, 
alternative = "two.sided"): p-value will be approximate in the presence of 
ties 


Two-sample Kolmogorov-Smirnov test 


data: Data Subset$OutsBal and systematic sample 5K$0OutsBal 
D = 0.010816, p-value = 0.6176 
alternative hypothesis: two-sided 


The KS test results show that the distribution is the same and hence the sample is 
a representation of the population by distribution. Figure 3-11 shows the histograms to 
show how the distribution is for a homogeneous data subset and a systematic sample. We 
can see that the distribution has not changed drastically. 


par(mfrow =c(1,2)) 
hist(Data Subset$OutsBal, breaks=50, col="red", xlab="Outstanding Balance", 
main="Homogenous Subset Data") 


hist(systematic sample 5K$OutsBal, breaks=50, col="green", xlab="Outstanding 


Balance", 
main="Systematic Sample ") 


103 


CHAPTER 3 = SAMPLING AND RESAMPLING TECHNIQUES 


Homogenous Subset Data Systematic Sample 


1500 


1000 


500 


Frequency 
20000 40000 60000 80000 
Frequency 





© © 
(T TŮTTŮTTITŮTTŮTT CT T T T ee 
0 10000 20000 30000 0 10000 20000 30000 
Outstanding Balance Outstanding Balance 


Figure 3-11. Homogeneous population and systematic sample distribution 


Key points: 


e Systematic sampling is equivalent to simple random sampling 
if done on a homogeneous set of data points. Also, a large 
population size suppresses the bias associated with systematic 
sampling for smaller sampling fractions. 


e Business and computational capacity are important criteria to 
choose a sampling technique when the population size is large. 
In our example, the systematic sampling gives a representative 
sample with a lower computational cost. (There is no call to 
random number generator, and hence no need to transverse the 
complete list of records.) 


3.7.4 Stratified Random Sampling 


When the population has sub-populations that vary, it is important for the sampling 
technique to consider the variations at the subpopulation (stratum) level and sample 
them independently at the stratum level. Stratification is the process of identifying 
homogeneous groups by featuring that group by some intrinsic property. For instance, 
customers living in the same city can be thought of as belonging to that city stratum. 
The strata should be mutually exclusive and collectively exhaustive, i.e., all units of the 
population should be assigned to some strata and one unit can only belong to one strata. 
Once we form the stratum then a simple random sampling or systematic sampling 
is performed at the stratum level independently. This improves the representativeness 
of sample and generally reduces the sampling error. Dividing the population in stratum 
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also helps you calculate the weighted average of the population, which has less variability 
than the total population combined. 
There are two generally accepted methods of identifying stratified sample size: 


e Proportional allocation, which samples equal proportions of the 
data from each stratum. In this case, the same sampling fraction 
is applied for all the stratum in the population. For instance, your 
population has four types of credit cards and you assume that 
each credit card type forms a homogeneous group of customers. 
Assume the number of each type of customers in each stratum is 
N14+N2+N3+N4=total, then in proportional allocation you will get 
a sample having the same proportion from each stratus 
(nl1/N1=n2/N2=n3/N3=n4/N4=sampling fraction). 


e Optimal allocation, which samples proportions of data 
proportionate to the standard deviation of the distribution of 
stratum variable. This results in large samples from strata with the 
highest variability, which means the sample variance is reduced. 


Another important feature of stratified sampling is that it makes sure that at least one 
unit is sampled from each strata, even if the probability of it getting selected is zero. It is 
recommended to limit the number of strata and make sure enough units are present in 
each stratum to do sampling. 

Advantages: 


e Greater precision than simple random sampling of the same 
sample size 


e Due to higher precision, it is possible to work with small samples 
and hence reduce cost 


e Avoid unrepresentative samples, as this method samples at least 
one unit from each stratum 


Disadvantages: 
e Notalways possible to partition the population in disjointed groups 


e Overhead of identifying homogeneous stratum before sampling, 
adding to administrative cost 


e Thin stratum can limit the representative sample size 


To construct an example of stratified sampling with credit card fraud data, we first 
have to check the stratums and then go ahead with sampling from stratum. For our 
example, we will create a stratum based on the CardType and State variables. 

Here, we explain step by step how to go about performing stratified sampling. 

Step 1: Check the stratum variables and their frequency in population. 

Lets assume CardType and State are stratum variables. In other words, we believe 
the type of card and the state can be used as a criteria to stratify the customers in logical 
buckets. Here are the frequencies by our stratum variables. We expect stratified sampling 
to maintain the same proportion of records in the stratified sample. 


105 


CHAPTER 3 = SAMPLING AND RESAMPLING TECHNIQUES 


#Frequency table for CardType in Population 


table(data$CardType) 
American Express Discover MasterCard Visa 
2474848 642531 4042704 2839917 
#Frequency table for State in Population 
table(data$State) 
Alabama American Samoa Arizona Arkansas California 
20137 162574 101740 202776 1216069 
Colorado Connecticut Delaware Florida Georgia 
171774 121802 20603 30333 608630 
Guam Hawaii Idaho Illinois Indiana 
303984 50438 111775 60992 404720 
Iowa Kansas Kentucky Louisiana Maine 
203143 91127 142170 151715 201918 
Maryland Massachusetts Michigan Minnesota Mississippi 
202444 40819 304553 182201 203045 
Missouri Montana Nebraska Nevada New Hampshire 
101829 30131 60617 303833 20215 
New Jersey New Mexico New York North Carolina North Dakota 
40563 284428 81332 91326 608575 
Ohio Oklahoma Oregon Pennsylvania Rhode Island 
364531 122191 121846 405892 30233 
South Carolina South Dakota Tennessee Texas Utah 
152253 20449 203827 812638 91375 
Vermont Virginia Washington West Virginia Wisconsin 
252812 20017 202972 182557 61385 
Wyoming 
20691 


The cross table breaks the whole population by the stratum variables, CardType and 
State. Each stratum represents a set of customers having similar behaviors as they come 
from the same stratum. The following output is trimmed for easy readability. 


#Cross table frequency for population data 
table(data$State,data$CardType) 


American Express Discover MasterCard Visa 


Alabama 4983 1353 8072 5729 
American Samoa 40144 10602 65740 46088 
Arizona 25010 6471 41111 29148 
Arkansas 50158 12977 82042 57599 
California 301183 78154 491187 345545 
Colorado 42333 11194 69312 48935 
Connecticut 30262 7942 49258 34340 
Delaware 4990 1322 8427 5864 
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Step 2: Random sample without replacement from each stratum, consider sampling 
10% of the size of the stratum. 

We are choosing the most popular way of stratified sampling, proportional sampling. 
We will be sampling 10% of the records from each stratum. 

Function : stratified() 

The stratified function samples from a data. frame or a data. table in which one 
or more columns can be used as a “stratification” or “grouping” variable. The result is a 
new data.table with the specified number of samples from each group. The standard 
function syntax is shown here: 


stratified(indt, group, size, select = NULL, replace = FALSE, keep.rownames 
= FALSE, bothSets = FALSE, ...) 


e group: This argument allows users to define the stratum variables. 
Here we have chosen CardType and State as the stratum variables. 
So in Total, we will have 4 (card types) X 52 (states) stratum to 
sample from. 


e size: In general, size can be passed as a number (equal numbers 
sample from each strata) or a sampling fraction. We will use the 
sampling fraction of 0.1. For other options type ?stratified in the 
console. 


e replace: This allows you to choose sampling with or without 
replacement. We have set it as false, which means sampling 
without replacement. 


We will be using this function to perform stratified random sampling. 
We can also do the stratified sampling using our standard sample() function as well 
with following steps: 


1. Create subsets of the data by stratum variables. 


2. Calculate the sample size for sampling fraction of 0.1, for each 
stratum. 


3. Do a simple random sampling from each stratum for the 
sample size as calculated. 


The previous results and the stratified() results will be the same. But the 
stratified() function will be faster to execute. You are encouraged to implement this 
algorithm and try out the other functions. 


set.seed(937) 

#We want to make sure that our sampling retain the same proportion of the 
cardtype in the sample 

#Do choose a random sample without replacement from each startum consisting 
of 10% of total size of stratum 

library(splitstackshape) 

stratified sample 10 percent<-stratified(data, group=e("CardType","State"),s 
ize=0.1,replace=FALSE) 
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Step 3: Check if the proportions of data points in the sample are the same as the 
population. 

Here is the output of stratified sample by CardType, State, and by cross tabulation. 
The values show that the sampling has been done across the stratum with the same 
proportion. For example, number of records Alabama and American Express is 4980, in 
stratified sample the number of Alabama and American Express cardholders is 1/10 of 
the population, i.e., 498. For all other stratum, the proportion is the same. 


#Frequency table for Cardlype in sample 
table(stratified sample 10 percent$CardType) 


American Express Discover MasterCard Visa 
247483 64250 404268 283988 

#Frequency table for State in sample 

table(stratified sample 10 percent$State) 


Alabama American Samoa Arizona Arkansas California 
2013 16257 10174 20278 121606 
Colorado Connecticut Delaware Florida Georgia 
17177 12180 2060 3032 60862 
Guam Hawaii Idaho Illinois Indiana 
30399 5044 11177 6099 40471 
Iowa Kansas Kentucky Louisiana Maine 
20315 9113 14218 15172 20191 
Maryland Massachusetts Michigan Minnesota Mississippi 
20245 4081 30455 18220 20305 
Missouri Montana Nebraska Nevada New Hampshire 
10183 3013 6061 30383 2022 
New Jersey New Mexico New York North Carolina North Dakota 
4056 28442 8133 9132 60856 
Ohio Oklahoma Oregon Pennsylvania Rhode Island 
36453 12219 12184 40589 3023 
South Carolina South Dakota Tennessee Texas Utah 
15225 2046 20382 81264 9137 
Vermont Virginia Washington West Virginia Wisconsin 
25281 2002 20297 18255 6138 
Wyoming 
2069 


#Cross table frequency for sample data 
table(stratified sample 10 percent$State,stratified sample 10 percent$CardType) 


American Express Discover MasterCard Visa 


Alabama 498 135 807 573 
American Samoa 4014 1060 6574 4609 
Arizona 2501 647 4111 2915 
Arkansas 5016 1298 8204 5760 
California 30118 7815 49119 34554 
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Colorado 4233 1119 6931 4894 
Connecticut 3026 794 4926 3434 
Delaware 499 132 843 586 


You can see that the proportion has remained the same. Here, we compare the 
properties of sample and population. The summarise() function shows the average 
of outstanding balance by strata. You can perform a pairwise t.test to see that the 
sampling has not altered the means of outstanding balance belonging to each strata. You 
are encouraged to do testing on the means by t.test(), as shown in the simple random 
sampling section. 


# Average outstanding balance by stratum variables 
summary _population<-summarise(group_by(data,CardType, State) ,OutstandingBalan 
ce Stratum=mean(OutsBal) ) 


#We can see below the want to make sure that our sampling retain the same 
proportion of the cardtype in the sample 

summary sample<-summarise(group_by(stratified sample 10 percent,CardType, Sta 
te) ,OutstandingBalance Sample=mean(OutsBa1) ) 


#Mean Comparison by stratum 
summary _mean_compare<-merge(summary population,summary sample, 
by=e("CardType", "State")) 


Again, we will do a KS test to compare the distribution of the stratified sample. We 
can see that the KS test shows that both have the same distribution. 


ks.test(data$OutsBal,stratified sample 10 percent$0utsBal,alternative="two. 
sided") 
Two-sample Kolmogorov-Smirnov test 


data: data$OutsBal and stratified sample 10 percent$0utsBal 
D = 0.00073844, p-value = 0.7045 
alternative hypothesis: two-sided 


Figure 3-12 shows the histograms to show how the distribution of outstanding 
balance looks for the sample and population. The visual comparison clearly shows that 
the sample is representative of the population. 


par(mfrow =c(1,2)) 


hist(data$OutsBal, breaks=50, col="red", xlab="Outstanding Balance", 
main="Population ") 
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hist(stratified sample 10 percent$OutsBal, breaks=50, col="green", 
xlab="Outstanding Balance”, 
main="Stratified Sample") 
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Figure 3-12. Population and stratified sample distribution 
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The distribution plot in Figure 3-12 reemphasizes the test results, both population 
and stratifies random sample have the same distribution. The stratified random sample is 
representative of the population. 

Key points: 


Stratified sampling should be used when you want to make sure 
the proportion of data points remains the same in the sample. 
This not only ensures representativeness but also ensures that all 
the stratum gets a representation in the sample. 


Stratified sampling can also help you systematically design the 
proportion of records from each stratum, so you can design a 
stratified sampling plan to change the representation as per 
business need. For instance, you are modeling a binomial 
response function, and the even rate or the proportion of 1 is 
very small in the dataset. Then you can do a stratified random 
sampling from stratum (0 or 1 response) and try to sample so that 
proportion of 1 increases to facilitate modeling. 
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3.7.5 Cluster Sampling 


Many times, populations contain heterogeneous groups that are statistically evident in 
the population. In those cases, it is important to first identify the heterogeneous groups 
and then plan the sampling strategy. This technique is popular among marketing and 
campaign designers, as they deal with characteristics of heterogeneous groups within a 
population. 

Cluster sampling can be done in two ways: 


e Single-stage sampling: All of the elements within selected clusters 
are included in the sample. For example, you want to study a 
particular population feature that’s dominant in a cluster, so you 
might want to first identify the cluster and its element and just 
take all the units of that cluster. 


e Two-stage sampling: A subset of elements within selected clusters 
is randomly selected for inclusion in the sample. This method is 
similar to stratified sampling but differs in the sense that here the 
clusters are parent units while in former case it was strata. Strata 
variables may themselves be divided into multiple clusters on the 
measure scale. 


For a fixed sample size, the cluster sampling gives better results when most of the 
variation in the population is within the groups, not between them. It is not always 
straightforward to choose sampling methods. Many times the cost per sample point 
is less for cluster sampling than for other sampling methods. In these kinds of cost 
constraints, cluster sampling might be a good choice. 

It is important to point out the difference between strata and cluster. Although both 
are overlapping subsets of the population, they are different in many respects. 


e While all strata are represented in the sample; in clustering only a 
subset of clusters are in the sample. 


e Stratified sampling gives best result when units within strata are 
internally homogeneous. However, with cluster sampling, the 
best results occur when elements within clusters are internally 
heterogeneous. 


Advantages: 


e Cheaper than other methods for data collection, as the cluster of 
interest requires less cost to collect and store, and requires less 
administrative cost. 


e (Clustering takes a large population into account in terms of 
cluster chunks. Since these groups/clusters are so large, deploying 
any other technique would be very difficult. Clustering is feasible 
only when we are dealing with large populations with statistically 
significant clusters present in them. 


e Reduction in variability of estimates is observed with other methods 
of sampling, but this may not be an ideal situation every time. 
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Disadvantages: 


e Sampling error is high due to the design of the sampling process. 
The ratio between the number of subjects in the cluster study and 
the number of subjects in an equally reliable, randomly sampled 
un-clustered study is called design effect, which causes the high 
sampling error. 


e Sampling bias: The chosen sample in cluster sampling will be 
taken as representative of the entire population and if that cluster 
has a biased opinion then the entire population is inferred to have 
the same opinion. This may not be the actual case. 


Before we show you cluster sampling, let’s artificially create clusters in our data 
by subsetting the data by international transaction. We will subset the data with a 
conditional statement on international transaction. Here you can see we are artificially 
creating five clusters. 


# Before i explain cluster sampling, lets try to subset the data such that 
we have clear samples to explain the importance of cluster sampling 
#Subset the data into 5 subgroups 

Data Subset Clusters 1 <-subset(data, IntTransc >2&IntTransc <5) 

Data Subset Clusters 2 <-subset(data, IntTransc >10&IntTransc <13) 

Data Subset Clusters 3 <-subset(data, IntTransc >18&IntTransc <21) 

Data Subset Clusters 4 <-subset(data, IntTransc >26&IntTransc <29) 

Data Subset Clusters 5 <-subset(data, IntTransc >34) 


Data Subset _Clusters<-rbind(Data Subset Clusters 1,Data Subset Clusters 2,Data_ 
Subset_Clusters 3,Data Subset Clusters 4,Data Subset Clusters 5) 


str(Data Subset Clusters) 


Classes ‘data.table' and ‘data.frame': 1291631 obs. of 14 variables: 
$ creditLine : int 1111111111... 
$ gender > int 1111111111... 
$ state : int 1111111111... 


$ CustomerID : int 136032 726293 1916600 2180307 3186929 3349887 3726743 
5121051 7595816 8058527 ... 

$ NumOfCards : int 1111111121... 

$ OutsBal : int 2000 2000 2000 2000 2000 2000 O O 2000 2000 ... 

$ DomesTransc: int 785 5 44 43 51 57 23515... 

$ IntTransc : int 3433443343... 

$ FraudFlag : int 0000000000... 


$ State : chr “Alabama” "Alabama" "Alabama" "Alabama" ... 

$ PostalCode : chr "AL" "AL" "AL" "AL" ... 

$ Gender : chr "Male" "Male" "Male" "Male" ... 

$ CardType : chr “American Express" “American Express’ “American 


Express. “American Express’ ... 
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$ CardName : chr "“SimplyCash® Business Card from American Express" 
“SimplyCash® Business Card from American Express" "SimplyCash® Business Card 
from American Express" "“SimplyCash® Business Card from American Express" ... 
- attr(*, ".internal.selfref")=<externalptr> 


We explicitly created clusters based on the International transactions. The clusters 
are created to show clustering sampling. 

One-stage cluster sampling will mean randomly choosing clusters out of five clusters 
for analysis. While two-stage sampling will mean randomly choosing few clusters and 
then doing stratified random sampling from them. In Figure 3-13, we will first create 
clusters using k-means (discussed in detail in Chapter 6) and then apply stratified 
sampling, assuming the cluster is the stratum variable. 

The k-means function creates clusters based on the centroid-based k-means 
clustering method. Since we have explicitly created five clusters, we will call k-means to 
form five clusters based on the international transaction values. We already know that the 
function will give us exactly five clusters as we created in the previous step. This has been 
done only for illustration purposes’ in real situations, you have to find out the clusters 
present in the population data. 


# Now we will treat the Data Subset Clusters as our population 
library (stats) 


kmeans clusters <-kmeans(Data Subset Clusters$IntTransc, 5, nstart =25) 


cat("The cluster center are ",kmeans clusters$centers ) 
The cluster center are 59.11837 22.02696 38.53069 47.98671 5.288112 


Next, we take a random sample of records just to plot them neatly, as plotting with a 
large number of records will not be clear. 


set.seed(937) 

# For plotting lets use only 100000 records randomly chosen from total data. 
library(splitstackshape) 
PlotSample<-Data Subset Clusters|[sample(nrow(Data Subset_ 

Clusters) ,size=100000, replace =TRUE, prob =NULL), | 


plot(PlotSample$IntTransc, col = kmeans clusters$cluster) 
points(kmeans clusters$centers, col =1:5, pch =8) 
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Figure 3-13. Input data segmented by the set of five classes by number of international 
transactions 


cluster sample combined<-cbind(Data Subset Clusters, kmeans clusters$cluster) 
setnames(cluster sample combined,"V2","ClusterIdentifier" ) 


Now, we show you the number of records summarized by each cluster. Take note of 
these numbers, as we will show you the two-stage cluster sampling. The sample will have 
the same proportion across the clusters. 


print("Summary of no. of records per clusters") 
[1] "Summary of no. of records per clusters" 
table(cluster sample combined$ClusterIdentifier) 


1 2 3 4 5 
67871 128219 75877 44771 974893 


Assuming the cluster identifier as the stratum variable and using the stratified() 
function to draw a sample having 10% of the stratum population respectively. 


set.seed (937) 

library(splitstackshape) 

cluster sample 10 percent<-stratified(cluster_ sample combined, group=e("Clust 
erIdentifier") ,size=0.1,replace=FALSE) 


This step has created the two-stage cluster sample, i.e., randomly selected 10% of the 
records from each cluster. Let’s plot the clusters with the cluster centers. 


print("Plotting the clusters for random sample from clusters") 

[1] "Plotting the clusters for random sample from clusters" 
plot(cluster sample 10 percent$IntTransc, col = kmeans clusters$cluster) 
points(kmeans clusters$centers, col =1:5, pch =8) 
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Figure 3-14. Clusters formed by K-means (star sign represents centroid of cluster) 


Next is the frequency distribution of cluster sample. Please go back and see the same 
proportions as on population used for clustering. The stratified sampling at stage two of 
clustering sampling has ensured that the proportions of data points remain the same, i.e., 
10% of the stratum size. 


print("Summary of no. of records per clusters") 
[1] "Summary of no. of records per clusters" 
table(cluster sample 10 percent$ClusterIdentifier) 


1 2 3 4 5 
6787 12822 7588 4477 97489 


Let’s now show how cluster sampling has impacted the distribution of outstanding 
balance compared with population and cluster samples. 


population summary <-summarise(group_by(data,CardType) ,OutstandingBalan 

ce Population=mean(OutsBa1) ) 

Warning in gmean(OutsBal): Group 1 summed to more than type ‘integer’ 

can hold so the result has been coerced to ‘numeric’ automatically, for 
convenience. 

cluster _summary<-summarise(group_by(cluster sample 10 percent,CardType) ,Outs 
tandingBalance Cluster=mean(OutsBal) ) 


summary mean _compare<-merge(population summary,cluster_ summary, by="CardType") 


print(summary mean compare) 
Source: local data table [4 x 3] 
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CardType OutstandingBalance Population 


1 American Express 3820.896 
2 Discover 4962.420 
3 MasterCard 3818. 300 
4 Visa 4584.042 


Variables not shown: OutstandingBalance Cluster (dbl) 


This summary shows how the mean of the outstanding balance impacted by cluster 
sampling based on the international transactions. For visual inspection, we will create 
histograms in Figure 3-15. You will see that the distribution is impacted marginally. This 
could be because the clusters we created assuming international transactions buckets 
were homogeneous and hence did not have a great impact on the outstanding balance. 
To be sure, you are encouraged to do at.test() to see if the means are significantly the 
same or not. 


par(mfrow =¢(1,2)) 
hist(data$OutsBal, breaks=50, col="red", xlab="Outstanding Balance", 
main="Histogram for Population Data") 


hist(cluster sample 10 percent$OutsBal, breaks=50, col="green", 
xlab="Outstanding Balance", 
main="Histogram for Cluster Sample Data ") 
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Figure 3-15. Cluster population and cluster random sample distribution 
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In other words, clustering sampling is the same as the stratified sampling; the only 
difference is that the startum variable exists in data and is an intrinsic property of data, while 
in clustering first we identify clusters and then do random sampling from those clusters. 

Key points: 


e Cluster sampling should be done only when there is strong evidence 
of clusters in population and you have strong business reason to 
justify the clusters and their impact on the modeling outcome. 


e Cluster sampling should not be confused with stratified sampling. 
In stratified sampling, the stratum are formed on the attributes 
in the dataset while clusters are created based on similarity 
of subject in population by some relation, e.g., distance from 
centroid, the same multivariate features, etc. Pay close attention 
while implementing clustering sampling and clusters need to 
exist and should make a business case of homogeneity. 


3.7.6 Bootstrap Sampling 


In statistics, bootstrapping is any sampling method or test or measure that relies on a 
sampling random sampling with replacement. Theoretically, you can create infinite size 
population to sample in bootstrapping. It is an advanced topic in statistics and widely 
used in cases where you have to calculate sampling measure of statistics, e.g., mean, 
variance, bias, etc., from a sample estimate of the same. 

Bootstrapping allows estimation of the sampling distribution of almost any statistic 
using random sampling methods. Jackknife predates the modern bootstrapping 
technique. Jackknife estimator of a parameter is found by repeatedly leaving out an 
observation and calculating the estimate. Once all the observation points are exhausted 
the average of the estimates is taken as the estimator. For a sample size of N, Jackknife 
estimate can also be found by aggregating the estimates of each N-1 estimate in the 
sample. It is important to understand the Jackknife approach, as it provides the basic idea 
behind the bootstrapping method of a sample metric estimation. 

The jackknife estimate of a parameter can be found by estimating the parameter for 
each subsample and omitting the ith observation to estimate the previously unknown 
value of a parameter (say x, ). 
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The jackknife technique can used to estimate variance of an estimator. 
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where x, is the parameter estimate based on leaving out the ith observation, and x,, is 


the estimator based on all of the sub samples. 
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In 1977, B. Efron of Stanford University published his noted paper, “Bootstrap 
Methods: Another Look at the Jackknife.” This paper provides the first detailed account 
of bootstrapping for a variety of sample metric estimation problems. Statistically, the 
paper tried to address the following problem: Given a random sample, X= (x1,x2,...Xn) 
from an unknown probability distribution F, estimate the sampling distribution of some 
prespecified random variable R(X,F), on the basis of the observed data x. We leave it to 
you to explore the statistical detail of the method. 

When you don’t know the distribution of the population (or you don’t even have a 
population), bootstrapping comes handy to create hypothesis testing for the sampling 
estimates. The bootstrapping technique will sample data from the empirical distribution 
obtained from the sample. In the case where a set of observations can be assumed to be 
from an independent and identically distributed population, this can be implemented by 
constructing a number of resamples with replacement of the observed dataset (and of equal 
size to the observed dataset). This comes in very handy when we have a small dataset and 
we are unsure about the distribution of the estimator to perform hypothesis testing 

Advantages: 


e Simple to implement; it provides an easy way to calculate 
standard errors and confidence intervals for complex unknown 
sampling distributions. 


e With increasing computing power, the bootstrap results get better. 


e One popular application of bootstrapping is to check for stability 
of estimates. 


Disadvantages: 


e Bootstrapping is asymptotically consistent, but does not provide 
finite sample consistency. 


e This is an advanced technique, so you need to be fully aware of 
the assumptions and properties of estimates derived from the 
bootstrap methods. 


In our R example, we will show how bootstrapping can be used to estimate a 
population parameter to create a confidence interval around that estimate. This helps in 
checking the stability of the parameter estimate and perform a hypothesis test. We will be 
creating the example on a business relevant linear regression methodology. 


Note Bootstrapping techniques are more relevant to estimation problems when you 
have a very small sample size and it is difficult to find the distribution of actual population. 


First, we fit a linear regression model on population data (without intercept). The 
model will be fit with response variable as an outstanding variable and predictor being 
number of domestic transactions. Business intuition says that the outstanding balance 
should be positively correlated with the number of domestic transactions. A positive 
correlation between dependent and independent variables implies the sign of the linear 
regression coefficient should be positive. 
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The coefficient that we get is the true value of the estimate coming from the 
population. This is the population parameter estimate, as it is calculating the full 
population. 


set.seed(937) 
library (boot ) 
# Now we need the function we would like to estimate 


#First fit a linear model and know the true estimates, from population data 
summary (1lm(OutsBal ~O +DomesTransc, data = data)) 


Call: 
lm(formula = OutsBal ~ O + DomesTransc, data = data) 


Residuals: 
Min 10 Median 30 Max 
-7713 -1080 1449 4430 39091 


Coefficients: 
Estimate Std. Error t value Pr(>|t]) 
DomesTransc 77.13469 0.03919 1968 <2e-16 *** 


Signif. codes: 0 ‘***' 0.001 ‘**' 0.01 ‘*' 0.05 '.' 0.1 °° 1 


Residual standard error: 4867 on 9999999 degrees of freedom 
Multiple R-squared: 0.2792, Adjusted R-squared: 0.2792 
F-statistic: 3.874e+06 on 1 and 9999999 DF, p-value: < 2.2e-16 


You can see the summary of the linear regression model fit on the population 
data. Now we will take a small sample of population (sampling fraction = 
10000/10,000,000=1/1000). Hence, our challenge is to estimate the coefficient of domestic 
transactions from a very small dataset. 

In this context, sampling can be seen as a process to create a larger set of samples 
from a small set of values to get an estimate of the true distribution of the estimate. 


set.seed(937) 

#Assume that we are only having 10000 data points and have to do the 
hypothesis test around significance of coefficient domestic transactions. As 
the dataset is small we will use the bootstarting to create the distribution 
of coefficient and then create a confidence interval to test the hypothesis. 


sample 10000 <-data[sample(nrow(data),size=10000, replace =FALSE, prob =NULL), | 


Now we have a small sample to work with. Let’s define a function named Coeff, 
which will return the coefficient of the domestic transaction variable. 
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It has a three arguments: 


e data: This will be the small dataset that you want to bootstrap. In 
our case, this is the sample dataset of 10000 records. 


e b: Arandom frame of indexes to choose each time the function is 
called. This will make sure each time a dataset is selected from a 
model that’s randomly chosen from the input data. 


e formula: This is an optional field. But this will be the functional 
form of the model which will be estimated by the linear 
regression. 


Here we have just incorporated the formula in the return statement. 


# Function to return Coefficient of DomesTransc 
Coeff =function(data,b, formula) { 
# b is the random indexes for the bootstrap sample 
d =data[b, | 
return(1m(OutsBal ~O +DomesTransc, data = d)$coefficients|[1] ) 
# thats for the beta coefficient 


} 


Now we can start bootstrap, so we will be using the function boot () from the 
boot library. It is very powerful function for both parametric and non-parametric boot 
strapping. We consider this an advanced topic and will not be covering details of this 
function. Interested readers are advised to read the function documents from CRAN. 

The inputs we are using for our example are: 


e data: This is the small sample data we created in the previous step. 


° statistics: This is a a function that will return the estimated 
value of the interested parameter. Here our function Coeff will 
return the value of coefficient of the domestic transactions. 


e R: This the number of bootstrap samples you want to create. A 
general rule of thumb is the more bootstrap samples you have, the 
narrower the confidence band. 


Note For this example, we are considering a smaller number of samples to be sure the 
confidence band is broad and with what confidence we can see the original estimate from 
the population. 


Here we call the function with R=50. 
set.seed (937) 


# R is how many bootstrap samples 
bootbet =boot(data=sample 10000, statistic=Coeff, R=50) 
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names (bootbet ) 
[1] "to" "4" "R" "data" "seed" 
[6] "statistic" "sim" "call" "stype" "strata" 


[11] "weights" 
Now plot the histograms and qq plots for the estimated values of the coefficient. 


plot(bootbet) 
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Figure 3-16. Histogram and QQ plot of estimated coefficient 


Now, we plot the histogram of parameter estimate. We can see that the bootstrap 
sample uncovered the distribution of the parameter. We can form a confidence interval 
around this and do hypothesis testing. 


hist(bootbet$t, breaks =5) 
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Histogram of bootbet$t 
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Figure 3-17. Histogram of parameter estimate from bootstrap 


Here we calculate the mean and variance of the estimated values from 
bootstrapping. Considering the distribution of coefficient is normally distributed, you can 
create a confidence band around the mean for the true value. 


mean (bootbet$t ) 
[1] 76.77636 
var (bootbet$t) 
[,1] 

[1,] 2.308969 


Additionally, to show how the distribution looks superimposed on a normal 
distribution from the previous parameters, do this: 


x <-bootbet$t 

h<-hist(x, breaks=5, col="red", xlab="Boot Strap Estimates", 
main="Histogram with Normal Curve") 
xfit<-seq(min(x) ,max(x), length=40) 

yfit<-dnorm(xfit ,mean=mean(bootbet$t) , sd=sqrt(war(bootbet$t) )) 
yfit <-yfit*diff(h$mids[1:2])*length(x) 

lines(xfit, yfit, col="blue", lwd=2) 


In Figure 3-18 you can see that we have been able to find the distribution of the 
coefficient and hence can do hypothesis testing on it. This also provided us a close 
estimate of the true coefficient. If you look closely, this idea is very close to what jackknife 
originally proposed. With more computing power, we have just expanded the scope of 
that method from the mean and standard deviation to any parameter estimation. 
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Histogram with Normal Curve 
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Figure 3-18. Histogram with normal density function 


The following code does a t.test() on the bootstrap values on coefficients with the 
true estimate of the coefficient from the population data. This will tell us how close we 
got to the estimate from a smaller sample and with what confidence we would be able to 
accept or reject the bootstrapped coefficient. 


t.test(bootbet$t, mu=77.13) 
One Sample t-test 


data: bootbet$t 
t = -1.6456, df = 49, p-value = 0.1062 
alternative hypothesis: true mean is not equal to 77.13 
95 percent confidence interval: 
76.34452 77.20821 
sample estimates: 
mean of x 
76.77636 


Key points: 


e Bootstrapping is a powerful technique that comes handy when we 
have little knowledge of the distribution of parameter and only a 
small dataset is available. 


e This technique is advanced in nature and involves a lot of 
assumptions, so proper statistical knowledge is required to use 
bootstrapping techniques. 


123 


CHAPTER 3 = SAMPLING AND RESAMPLING TECHNIQUES 


3.8 Monte Carlo Method: Acceptance-Rejection 
Method 


In modern times, Monte Carlo methods have become a separate field of study in 
statistics. Monte Carlo methods leverage the computationally heavy random sampling 
techniques to estimate the underlying parameters. This techniques is important in 
stochastic equations where a exact solution is not possible. The Monte Carlo techniques 
are very popular in the financial world, specifically in financial instrument valuation and 
forecasting. 

In statistics, acceptance-rejection methods are very basic techniques to sample 
observations from a distribution. In this method, random sampling is done from a 
distribution and based on preset conditions the observation is accepted or rejected, and 
hence it lies in broad bucket of a Monte Carlo method. 

In this method, we first estimate the empirical distribution of the dataset (r empirical 
density function: EDF) by looking at cumulative probability distribution. After we get the 
EDE we set the parameters for another known distribution. The known distribution will 
be covering the EDF. 

Now we start sampling from the known distribution and accept the observations if 
it lies within the EDF; otherwise, we reject it. In other words, rejection sampling can be 
performed by following these steps: 


1. Sample a point from the proposed distribution (say x). 


2. Draw a vertical line at this sample point x up to the curve of 
proposed distribution (Figure 3-19). 


3. Sample uniformly along this line from 0 to max (PDF), PDF 
stands for probability density function. If a sample’s value is 
greater than maximum value, reject it; otherwise accept it. 


This method helps us draw a sample of any distribution from the known distribution. 
These methods are very popular in stochastic calculus for financial product valuation and 
other stochastic processes. 

To illustrate this method, we will draw a sample from a beta distribution with 
parameters of (3,10). The beta distribution looks Figure 3-19. 


curve(dbeta(x, 3,10),0,1) 
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Figure 3-19. Beta distribution plot 


We first create a sample, of 5000 with random values between 0 and 1. Now we 
calculate the beta density corresponding to the 5000 random values sample. 


set.seed(937) 
sampled <-data.frame(proposal =runif(5000,0,1)) 


sampled$targetDensity <-dbeta(sampled$proposal, 3,10) 


Now, we calculate the maximum probability density for our proposed distribution 
(beta PDF). Once we have maximum density and sample density for 5000 cases, we start 
our sampling by rejection as follows. Create a random number between 0 and 1: 


Reject the value as coming from beta distribution if the value is more than 
the sample density we calculated for pre-known beta distribution maxDens 
=max(sampled$targetDensity, na.rm = T) 

sampled$accepted =ifelse(runif(5000,0,1) <sampled$targetDensity /maxDens, 
TRUE, FALSE) 


Figure 3-20 shows you a plot of EDF of beta (3,10) and the histogram of the sample 
dataset. We can see we have been able to create the desired sample by accepting values 


from a random numbers that lie below the red line, i.e., PDF of beta distribution. 


hist(sampled$proposal[sampled$accepted], freq = F, col ="grey", breaks =100) 
curve(dbeta(x, 3,10),0,1, add =T, col ="red") 
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Histogram of Accepted Sample 
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Figure 3-20. Sampling by rejection 


3.9 A Qualitative Account of Computational 
savings by Sampling 


This section shows a small example to help you understand how sampling is also helpful 
in reducing computational costs. To show this, we will first fit a linear regression model 
using the population dataset and then will again fit the same model on a smaller sample. 

We know from our discussion of sampling that if sampling is done properly, we can 
estimate the population parameters with very high confidence. For illustration purposes, 
we will show linear regression fitting on a population that has 10 million records and a 
sample of size 10000. 

Next, we call a function sys.time(), which returns the current time of the system. 
Using this function, we will calculate the calculation time of the function for population 
and sample. 

First, we fit a linear regression model with the total population data. 


# estimate parameters 
library (MASS) 
start.time <-Sys.time() 


population 1lm<-1m(OutsBal ~DomesTransc +Gender, data = data) 


end.time <-Sys.time() 
time.taken 1 <-end.time -start.time 
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cat("Time taken to fit linear model on Population",time.taken 1 ) 
Time taken to fit linear model on Population 18.74036 


Now, let’s fit the same model on a random sample of 10000 values. 
start.time <-Sys.time() 
sample 1lm<-lm(OutsBal ~DomesTransc +Gender, data = sample 10000) 


end.time <-Sys.time() 
time.taken 2 <-end.time -start.time 


cat("Time taken to fit linear model on Sample ",time.taken 2 ) 
Time taken to fit linear model on Sample 0.01551104 


We can see how different the times have been in both the computations. (Note: The 
time shown are based on the computation power of the author; you may get different 
times based on your system configuration. ) 

Essentially, the operation of population took a very long time. Estimation with 
population data took 1000+ times longer than the same estimation on the sample. 


3.10 Summary 


In this chapter we covered different sampling techniques and showed how these 
sampling techniques reduce the volume of data to process and the same time retain 
properties of the data. The best sampling method to apply on any population is simple 
random sampling without replacement. 

We also discussed bootstrap sampling, which is an important concept as you 
can estimate distribution of any parameter by this method. At the end, we showed an 
illustration of sampling by rejection, which allows us to create any distribution from 
known distributions. This techniques is based on the Monte Carlo simulation and is very 
popular in financial services. 

This chapter plays an important role in reducing the volume of data to apply in our 
machine learning algorithms, thereby keeping the population variance intact. 

In next chapter, we will look at the properties of data with visualization. If we use the 
appropriate sampling, the same visualization and trends will appear from populations as 
they do from the sample. 
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Data Visualization in R 





Data visualization is the process of creating and studying the visual representation of 
data to bring some meaningful insights. Michael Friendly’s 2009 paper titled “Milestones 
in the history of thematic cartography, statistical graphics, and data visualization,’ 
provides the following overarching definition of information visualization: 


Information visualization is the broadest term that could be taken 
to subsume all the developments described here. At this level, almost 
anything, if sufficiently organized, is information of a sort. Tables, 
graphs, maps, and even text, whether static or dynamic, provide some 
means to see what lies within, determine the answer to a question, find 
relations, and perhaps apprehend things which could not be seen so 
readily in other forms. 


This comprehensive definition should make you aware what a large scope 
visualization covers. The broader fields of Information visualization is also called 
infographics, where information might be stored in other formats than data. Our focus 
in this chapter is only one specific type of visualization, which is commonly called data 
visualization. Data visualization specifically deals with visualizing the information in a 
given data. This can include multiple types of charts, graphs, colors, line plots, etc. Data 
visualization is an effective way to present data because it shifts the balance between 
perception and cognition to take fuller advantage of the brain's abilities. The ways 
we encode the information is very important to make direct pathways into the brain 
cognition. The core tools used to encode information in a visualization are color, size, 
shape, numbers, and other properties. 

Data visualization has brought about a lot of benefits for industry and academia. 
Data visualization led the wave of the analytics world for quite some years and is expected 
to lead the curve for the next decade. This phenomenal growth has been possible because 
visualization is very useful for understanding massive data that we are gathering in our 
industry and academic research. The first step for data science is to understand the data, 
only then do start thinking about model and algorithm use. There are many benefits to 
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embrace data visualization as an integral part of data science process. Some of the direct 
benefits of the data visualization are: 


e Identifying red spots in data, starting diagnostics 
e Tracking and identifying relations among different attributes 
e Seeing the trend and fallouts to understand the reasons 


e Summarizing complicated long spreadsheets and databases into 
visual art 


e Easy to use and very impactful way to store and present 
information and others 


The market has many paid visualization software suites and on-demand cloud 
applications that can create meaningful visuals by the click of a button. However, we will 
explore the power of open source packages and tools for creating visualization in R. 

Any kind of data visualization fundamentally depends on four key elements of 
data presentation, namely Comparison, Relationship, Distribution, and Composition. 
Comparison is used to see the differences between multiple items at given point in time 
or to see the relative change in a variable over a time period. A relationship element 
helps in finding correlation between two or more variables with an increase or decrease 
in values. Scatter and bubble chart are some examples in this category. Distribution 
charts like column and line histograms show the spread of data. For instance, data with 
skewness toward left or right could be easily spotted. Composition refers to a stacked 
chart with multiple components like a pie chart or stacked column/area chart. In our 
PEBE ML process flow, visualization plays a key role in the exploration phase. 

Visualization serves as an aid in story telling by harnessing the power of data. There 
are plenty of examples to show patterns emerging from some simple plots, which otherwise 
is difficult to find even after using sophisticated statistics. Throughout this chapter, we will 
explore the four elements of data presentation with suitable examples and highlight how 
important role does visualization plays in better understanding the data to its finest of 
detail. Although we put this dedicated chapter for data visualization, the adaption of the 
approaches taught here is stretching across every other chapter of this book. 


4.1 Introduction to the ggplot2 Package 


R developers have created a good collection of visualization tool library. Being open 
source, these packages get updated very rapidly with new features. Another remarkable 
development in R tools for visualization is that the developers have been able to create 
functions that can replicate some of the high computational 3D plots and model outputs. 
The most important of all packages available in R for visualizations is ggplot2(). 
geplot2 is a data visualization package created by Hadley Wickham in 2005. It’s an 
implementation of Leland Wilkinson's Grammar of Graphics—a general scheme for data 
visualization that breaks up graphs into semantic components such as scales and layers. 
It is also important to state here that the other powerful plotting function that we have 
used multiple time is plot(). Plot() and ggplot2() are extensively used in the book. 
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Before we go deeper into this chapter, this section includes a quick guide with some basic 
understanding of ggplot and its various layers, which you will see being used throughout 
this chapter (for a detailed study of ggplot, we recommend “ggplot2: Elegant Graphics for 
Data Analysis,” by Hadley Wickham [1]). The following descriptions are taken from the R 
documentation: 


e ggplot(): Initializes a ggplot object. It can be used to declare 
the input data frame for a graphic and to specify the set of plot 
aesthetics intended to be common throughout all subsequent 
layers unless specifically overridden. 


e aes(): Generates aesthetic mappings that describe how variables 
in the data are mapped to visual properties (aesthetics) of geoms. 
This function also standardizes aesthetic names by performing 
partial name matching, converting color to color, and old style R 
names to ggplot names (for example, pch to shape, cex to size). 


e geom point(): The point geom is used to create scatterplots. 


e geom line: Connects the observation in order of the variable on 
the x-axis. 


e scale x_1log10(): Transformation functions that prove very 
useful while setting the scale of the plots and charts. 


e scale size continuous(): Scales the area. The size aesthetic is 
most commonly used for points and text, and humans perceive 
the area of points, so this provides for optimal perception. 


e facet wrap(): Most displays are roughly rectangular, so if you 
have a categorical variable with many levels, it doesn't make 
sense to try to display them all in one row (or one column). To 
solve this dilemma, facet_wrap wraps a 1D sequence of panels 
into 2D, making best use of screen real estate. 


e scale fill manual(): Create your own discrete scale, which 
includes, color, size, shape, etc. 


e  xlab(): Changes x-axis labels. 
e ylab(): Changes y-axis labels. 
e ggtitle(): Changes the plot and legend title. 


e  theme(): Use this function to modify theme settings. This function 
comes with a very rich set of parameters that provides for creating 
elegant looking graphics. Detailed ggplot2() documentation 
can be accessed from at https://cran.r-project.org/web/ 
packages/ggplot2/gegplot2.pdf. 


There are some other packages that we use in this chapter and would like readers to 
explore more of them. Some of them are googleVis(), ggmap(), ggrepel(), waterfall(), 
and rCharts(). These are all highly recommended. 
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4.2 World Development Indicators 


A good data visualization tells a story with numbers. Economics is one of the fields that 
has integrated well into the visualization world. The visualization in economics has been 
very old. Playfair’s 1801 pie-circle-line chart, comparing population and taxes in several 
nations, is a proof of how old the relationship between economics and visualization 

is. Michael Friendly provided a comprehensive history and early examples of data 
visualization in his paper, also mentioned in previous section. 

In this chapter we will be discussing chart types with some examples. Half of the 
chapter discusses economic indicators to build visualizations. The specific plots and 
graphs will be discussed with specific examples. The World Bank collects data to monitor 
economic indicators across the world. For details of the data and economic principles, 
visit http: //www.worldbank.org/. 

The following section is a quick introduction of core indicators. A suitable 
visualization used for understanding its meaning and impact will be presented in 
following sections. There has been lot of good research using many of the World Bank’s 
data by social scientists in various sectors. We have cherry picked a few really impactful 
parts of that research and brought the real essence of the data into view. As we move from 
one example to the other, there will be emphasis given to the right type of visualization 
and extracting meaning out of the data without looking at the hundreds of rows and 
columns of a CSV or Excel file. Many of these visualizations are also provided on the 
World Bank web site; however, here in this book, you will learn how to use the ggplot 
package available in R to produce different graphs, charts, and plots. Instead of following 
a traditional approach of learning the grammar of graphics and then discussing a lot of 
theory on visualization, in this book, we have chosen a theme (World Bank’s development 
indicators) and will take you through a journey by means of storytelling. On the way, 
various types of visualization will be introduced. 


4.3 Line Chart 


A line chart is a basic visualization chart type in which information is displayed in a 

series of data points called markers connected by line segments. Line charts are used for 
showing trends in multiple categories of a variable. For instance, Figure 4-1 shows the 
growth of the Gross Domestic Product (GDP) over the years for the top 10 countries based 
on their most recent reported GDP figures. It helps in visualizing the trend in GDP growth 
for all these countries in a single plot. 


library (reshape) 
library(ggplot2) 


GDP <-read.csv("Dataset/Total GDP 2015 Top 10.csv") 
names(GDP) <-e("Country", "2010","2011","2012","2013","2014","2015") 


The following code uses a very important function that will be repeated in later 
sections as well, called melt(). 
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The melt function takes data in wide formats and stacks a set of columns into a 
single column of data. You can think of it as melting the current dimensions and getting 
simpler dimensions. Melt () is available in the reshape2() package. In the following code, 
you will, after reshaping the dataset, reduce it to three columns. The columns are stacked 
versions of the same information along multiple columns. The melt () function can only 
melt the categorical attributes; the numeric ones are aggregated . 


GDP_Long Format <-melt(GDP, id="Country") 
names(GDP_ Long Format) <-e("Country", "Year", "GDP USD Trillion") 


This function is very important to understand in terms of how it is creating the plot 
using ggplot. Let’s break down this once, the same concept follows for these plots: 


e Aes(): The aesthetics of the plot, it tells the ggplot() object the 
input data, the x and y values, and other options. 


e Geom line: This adds a layer to the plot with a line type as defined 
in aes(). 


e Geom point: This adds points to another layer of plot, the features 
of the type of points and their properties is provided in aes(), for 
instance, in the following code, we want points on each line with 
the color of the points being the same for each country and the 
size of point to be 5. 


e Theme: This command has options to design the theme of the plot 
canvas. 


e Xlab: Labeling the x-axis. 
e Ylab: Labeling the y-axis. 
e Ggtittle: Title of the plot. 


In this this book, you might find new ways, so always make sure to visit the 
ggplot2() manual for any specific need. The chances are good that you will be able to 
have the kind of visualization you want. 


ggplot(GDP Long Format, aes(x=Year, y=GDP USD Trillion, group=Country)) + 
geom_line(aes(colour=Country)) + 
geom_point(aes(colour=Country),size =5) + 

theme (legend.title=element_text(family="Times",size=20), 

legend. text=element_text(family="Times",face ="italic",size=15), 
plot.title=element_text(family="Times", face="bold", size=20), 
axis.title.x=element_text(family="Times", face="bold", size=12), 
axis.title.y=element_text(family="Times", face="bold", size=12)) + 
xlab("Year") + 

ylab("GDP (in trillion USD)") + 

getitle("Gross Domestic Product - Top 10 Countries") 
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Figure 4-1. A line chart showing the top 10 countries based on their GDP 


Clearly, among the top 10, the United States is leading the race, followed by China 
and Japan. So, without looking at the data, we are seeing rich information being shown in 
this visualization. Now, the obvious next question that comes to your mind is, what really 
makes any country's GDP go up or down? Let’s try to understand for these countries, 
how much percentage of their GDP is contributed by agriculture, the service sector, and 
industry. 


# Agriculture 

Agri GDP <-read.csv("Dataset/Agriculture - Top 10 Country.csv") 
Again, we melt the data into smaller numbers of columns to allow plotting. 

Agri GDP Long Format <-melt(Agri GDP, id ="Country") 

names(Agri GDP Long Format) <-e("Country", "Year", "Agri Perc") 


Agri GDP Long Format$Year <-substr(Agri GDP Long Format$Year, 2,length(Agri_ 
GDP_Long Format$Year) ) 
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Apply the ggplot2() options to create plots as follows: 


ggplot(Agri GDP Long Format, aes(x=Year, y=Agri Perc, group=Country)) + 
geom_line(aes(colour=Country)) + 
geom_point(aes(colour=Country),size =5) + 

theme (legend.title=element_text(family="Times",size=20), 

legend. text=element_text(family="Times",face ="italic",size=15), 
plot.title=element_text(family="Times", face="bold", size=20), 
axis.title.x=element_text(family="Times", face="bold", size=12), 
axis.title.y=element_text(family="Times", face="bold", size=12)) + 
xlab("Year") + 

ylab("Agriculture % Contribution to GDP") + 

ggtitle("Agriculture % Contribution to GDP - Top 10 Countries") 
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Figure 4-2. A line chart showing the top 10 countries based on percent contribution to 
GDP from agriculture 


While countries like India and Brazil, which didn't get the top three spots when we 
looked at the GDP, top the charts in agriculture (along with China, which comes in top 
three here as well). This shows the importance these countries give to agriculture. 

# Service 


Service GDP <-read.csw("Services - Top 10 Country.csv") 


Service GDP Long Format <-melt(Service GDP, id ="Country") 
names(Service GDP Long Format) <-e("Country", "Year", "Service Perc") 
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Service GDP Long Format$Year <-substr(Service GDP Long Format$Year, 
2,length(Service GDP Long Format$Year) ) 


ggplot(Service GDP Long Format, aes(x=Year, y=Service Perc, group=Country)) + 
geom_line(aes(colour=Country)) + 
geom_point(aes(colour=Country),size =5) + 
theme(legend.title=element_text(family="Times",size=20), 
legend.text=element_text(family="Times",face ="italic",size=15), 
plot.title=element_text(family="Times", face="bold", size=20), 
axis.title.x=element_text(family="Times", face="bold", size=12), 
axis.title.y=element_text(family="Times", face="bold", size=12)) + 
xlab("Year") + 

ylab("Service sector % Contribution to GDP") + 

getitle("Service sector % Contribution to GDP - Top 10 Countries") 
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Figure 4-3. A line chart showing the top 10 countries based on percent contribution to 
GDP from the service sector 


Now, contrary to agriculture, looking at the service sector, you will understand 
the reason behind the large GDP of the United States, China, and the United Kingdom. 
These countries have typically built their strong economies with service sectors. So, when 
you hear about Silicon Valley in the United States and London being the world's largest 
financial center, it’s actually their economies’ biggest growth drivers. 
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# Industry 
Industry GDP <-read.csw("Industry - Top 10 Country.csv") 


Industry GDP Long Format <-melt(Industry GDP, id ="Country") 
names(Industry GDP Long Format) <-e("Country", "Year", "Industry Perc") 
Industry GDP Long Format$Year <-substr(Industry GDP Long Format$Year, 
2,length(Industry GDP Long Format$Year) ) 


ggplot(Industry GDP Long Format, aes(x=Year, y=Industry Perc, 
sroup=Country)) + 

geom_line(aes(colour=Country)) + 
geom_point(aes(colour=Country),size =5) + 

theme (legend.title=element_text(family="Times",size=20), 

legend. text=element_text(family="Times",face ="italic",size=15), 
plot.title=element_text(family="Times", face="bold", size=20), 
axis.title.x=element_text(family="Times", face="bold", size=12), 
axis.title.y=element_text(family="Times", face="bold", size=12)) + 
xlab("Year") + 

ylab("Industry % Contribution to GDP") + 

getitle("Industry % Contribution to GDP - Top 10 Countries") 
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Figure 4-4. A line chart showing the top 10 countries based on percent contribution to 
GDP from industry 
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After looking at agriculture and service sector, industry is the third biggest 
component in the GDP pie. And this particular component is by far led by China and 
their manufacturing industry. This is why you see many big brands like Apple embedding 
a label in their products that says, "Designed by Apple in California. Assembled in China’. 
It’s not just mobile phones or companies like Apple, China is a manufacturing hub for 
many product segments like apparel and accessories, automobile parts, motorcycle parts, 
furniture, and the list goes on. 

So, the overall trend shows while the industry and the service sector keep increasing 
in their contributions to GDP, agriculture has seen a steady decrease. Is this a signal 
of growth or a compromise of our food sources in the name of more lucrative sectors? 
Perhaps we will leave that question for the economic experts to answer. However, we 
definitely see how this visualization can show us insights that would have been difficult 
otherwise to interpret from the raw data. 

In concluding remarks, among these big economies, many countries are witnessing a 
drastic drop in their industry output, like China, France, Australia, and Japan. India is the 
only country among these 10, where there has been a steady increase of industrial output 
over the years, which is a sign of development. Having said that, it still remains to see how 
agriculture and the service sector are balanced for the unprecedented growth in Industry. 
Even in this situation of unbalanced economies of developed and developing countries, 
what really helps to keep the balance is that the world is lot more free when it comes to 
trade. If you have strong agricultural output, you are free to export your production to 
other countries where it’s deficient and the same goes with the other sectors as well. 

Before we embark on another story through visualization, the following section 
shows a stacked column chart showing percentage contributions from each of the sectors 
to the world’s total GDP. 


4.4 Stacked Column Charts 


Stacked column charts are an elegant way of showing the composition of various categories 
that make up a particular variable. Here in the example in Figure 4-5, it’s easy to see how 
much percentage contribution each of these sectors has in the world's total GDP. 
library(plyr) 

World Comp GDP <-read.csw("World GDP and Sector.csv") 


World Comp GDP Long Format <-melt(World Comp GDP, id ="Sector") 
names(World Comp GDP Long Format) <-e("Sector", "Year", "USD") 


World Comp GDP Long Format$Year <-substr(World Comp GDP Long Format$Year, 
2,length(World Comp GDP Long Format$Year) ) 


# calculate midpoints of bars 


World Comp _GDP_ Long Format_Label <-ddply(World Comp GDP Long Format, .(Year), 
transform, pos =cumsum(USD) -(0.5 *USD)) 
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ggplot(World Comp GDP Long Format Label, aes(x = Year, y = USD, fill = 
Sector)) + 

geom_bar(stat ="identity") + 

geom_text(aes(label = USD, y = pos), size =3) + 
theme(legend.title=element_text(family="Times",size=20), 

legend. text=element_text(family="Times",face ="italic",size=15), 
plot.title=element_text(family="Times", face="bold", size=20), 
axis.title.x=element_text(family="Times", face="bold", size=12), 
axis.title.y=element_text(family="Times", face="bold", size=12)) + 
xlab("Year") + 

ylab("% of GDP") + 

getitle("Contribution of various sector in the World GDP") 
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Figure 4-5. A stacked column chart showing the contribution of various sectors to the 
world’s GDP 


It’s clear from the stacked column chart in Figure 4-5 that the service sector has a 
major contribution all the years, followed by industry, and then agriculture. As the size of 
each block does not change meaning, the GDP has grown with similar ratios among these 
sectors. 

The age dependency ratio is a good measure to show how this line plots and the 
stacked column chart can help investigate the measure. As defined by the World Bank, 
the age dependency ratio is the ratio of dependents—people younger than 15 or older 
than 64— to the working-age population—those aged between 15-64. 

If the age dependency ratio is very high for a country, the government’s expenditure 
goes up on health, social security, and education, which are mostly spent on people 
younger than 14 or older than 64 (the numerator) because the number of people 
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supporting these expenditures (people aged between 15-64) is less (the denominator). 
This also means individuals in the workforce have to take more of the burden to support 
their dependents than what is recommended. And at times, this leads to social issues like 
child labor (people aged less than 14 years ending up in the adult workforce). So, many 
developing economies where age dependency is high have to deal with these issues. The 
stacked line chart in Figure 4-6 shows how the working age ratio has been decreasing over 
the years for the top 10 countries. 


library(xreshape2 ) 
library(ggplot2) 


Population Working Age <-read.csw("Age dependency ratio - Top 10 Country.csv") 


Population Working Age Long Format <-melt(Population Working Age, id ="Country") 
names(Population Working Age Long Format) <-e("Country", "Year", "Wrk Age Ratio") 
Population Working Age Long Format$Year <-substr(Population Working Age_ 
Long Format$Year, 2,length(Population Working Age Long Format$Year) ) 


ggplot(Population Working Age Long Format, aes(x=Year, y=Wrk Age Ratio, 
gsroup=Country)) + 

geom_line(aes(colour=Country)) + 
geom_point(aes(colour=Country),size =5) + 
theme(legend.title=element_text(family="Times",size=20), 
legend.text=element_text(family="Times",face ="italic",size=15), 
plot.title=element_text(family="Times", face="bold", size=20), 
axis.title.x=element_text(family="Times", face="bold", size=12), 
axis.title.y=element_text(family="Times", face="bold", size=12)) + 
xlab("Year") + 

ylab("Working age Ratio") + 

getitle("Working age Ratio - Top 10 Countries") 
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Working age Ratio - Top 10 Countries 
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Figure 4-6. A stacked line chart showing the top 10 countries based on their working 
age ratio 


If you look at the line charts in Figures 4-6 and 4-7, you will notice, in recent years, 
countries like Japan and France have the largest ageing population, hence a higher 
age dependency ratio, whereas, countries like India and China have a strong and large 
population of young people and thus show a steady decrease in this ratio over the years. 
For instance, in the year 2015, India and China reported 65.6% and 73.22% of their 
population aged between 15 and 64, respectively (34.41% and 26.78% with people aged 
below 14 and above 65, respectively). The same percentage for Japan and France is 60.8 
and 62.4, respectively (33.19% and 37.57%, with people aged below 14 and above 65, 
respectively). 


library(reshape2) 
library(ggplot2) 


library(plyr) 

Population Age <-read.csw("Population Ages - All Age - Top 10 Country.csv") 
Population Age Long Format <-melt(Population Age, id ="Country") 
names(Population Age Long Format) <-e("Country", "Age Group", "Age Perc") 
Population Age Long Format$Age Group <-substr(Population Age Long_ 
Format$Age Group, 2,length(Population Age Long Format$Age Group) ) 


# calculate midpoints of bars 
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Population Age Long Format Label <-ddply(Population Age Long Format, .(Country), 
transform, pos =cumsum(Age Perc) -(0.5 *Age Perc)) 


ggplot(Population Age Long Format Label, aes(x = Country, y = Age Perc, fill 
= Age Group)) + 
geom_bar(stat ="identity") + 
geom_text(aes(label = Age Perc, y = pos), size =3) + 
theme(legend.title=element_text(family="Times",size=20), 
legend.text=element_text(family="Times",face ="italic",size=15), 
plot.title=element_text(family="Times", face="bold", size=20), 
axis.title.x=element_text(family="Times", face="bold", size=12), 
axis.title.y=element_text(family="Times", face="bold", size=12)) + 

Age Group - % of Total Population - Top 10 Country 
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ylab("% of Total Population") + 
getitle("Age Group - % of Total Population - Top 10 Country") 

Figure 4-7. A stacked bar chart showing the constituents of different age groups as a 
percentage of the total population 











Age Group 
flea dd 
BB iS.t0.64 


B sh. cand above 







vo af Tatal Populadben 





United Hinge 


BEY 


In a way, if you look at it, many economic factors—like income parity, inflation, 
imports and exports, GDP, and many more—have a direct or indirect effect on population 
growth and ageing. With population growth slowing down, as shown in Figure 4-8, for most 
of countries, there is a need for good public polices and awareness campaigns from the 
government in order to balance the ageing and younger population over the coming years. 


library (reshape? ) 
library(ggplot2) 
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Population Growth <-read.csw("Population growth (annual %) - Top 10 Country.csv") 


Population Growth Long Format <-melt(Population Growth, id ="Country") 
names(Population Growth Long Format) <-e("Country", "Year", "Annual Pop Growth") 
Population Growth Long Format$Year <-substr(Population Growth Long _ 
Format$Year, 2,length(Population Growth Long Format$Year) ) 


ggplot(Population Growth Long Format, aes(x=Year, y=Annual Pop Growth, 
gsroup=Country)) + 

geom_line(aes(colour=Country)) + 
geom_point(aes(colour=Country),size =5) + 

theme (legend.title=element_text(family="Times",size=20), 

legend. text=element_text(family="Times",face ="italic",size=15), 
plot.title=element_text(family="Times", face="bold", size=20), 
axis.title.x=element_text(family="Times", face="bold", size=12), 
axis.title.y=element_text(family="Times", face="bold", size=12)) + 
xlab("Year") + 

ylab("Annual % Population Growth") + 

getitle("Annual % Population Growth - Top 10 Countries") 
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Figure 4-8. A line chart showing the top 10 countries and their annual percentage of 
population growth 
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These plot are very interesting to peruse. The population growth for a few countries 
is very erratic while for others, it’s stable and decreasing. For instance, see the population 
growth of India, which has been steadily decreasing, while for the United States it 
stabilized and then increased. 


4.5 Scatterplots 


A scatterplot is a graph that helps identify if there is a relationship between two variables. 
Scatterplots use Cartesian coordinates to show two variables on an x- and y-axis. Higher 
dimensional scatterplots are also possible but they are difficult to visualize, hence 
two-dimensional scattercharts are very popular. If we add dimensions of color or shape 
Or size, SO We can present more than two variables on a two-dimensional scatterplot as 
well. In this case, we will look at a population growth indicator from the World Bank’s 
development indicators. 

Any economy's strength is its people, and it is most important to measure if the 
citizens are doing well in terms of their financials, health, education, and all the basic 
necessities. A robust and strong economy is only built if it’s designed and planned to keep 
the citizens at the center of everything. So, while GDP as an indicator signifies the growth 
of the country, there are many indicators that measure how well people are growing with 
the GDP. So, before we look at such indicators, let’s try to explore the basic characteristics 
of the data using some of the widely used visualization tools, like scatterplots, boxplots, 
and histograms. Let’s see if there are some patterns emerging from the population growth 
data and the GDP of the top 10 countries. 


library (xreshape2 ) 
library(ggplot2) 


GDP Pop <-read.csw("GDP and Population 2015.csv") 


ggplot(GDP Pop, aes(x=Population Billion, y=GDP_Trilion USD) )+ 
geom_point(aes(color=Country),size =5) + 
theme(legend.title=element_text(family="Times",size=20), 
legend.text=element_text(family="Times",face ="italic",size=15), 
plot.title=element_text(family="Times", face="bold", size=20), 
axis.title.x=element_text(family="Times", face="bold", size=12), 
axis.title.y=element_text(family="Times", face="bold", size=12)) + 
xlab("Population ( in Billion)") + 

ylab("GDP (in Trillion US $)") + 

getitle("Population Vs GDP - Top 10 Countries") 
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Population Vs GDP - Top 10 Countries 
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Figure 4-9. A scatterplot showing the relationship between population and GDP for the 
top 10 countries 


The scatterplot in Figure 4-9 shows that for countries like United States (US) since 
2009, the population has been relatively low compared to other countries in the top 10; 
however, the United States, being the worlds’ largest economy, has a very large GDP, 
taking the point high in the y-axis of the scatterplot. Similarly, if you look at China, with 
the worlds’ largest population of 1.37 billion and 10.8 trillion of US dollars of GDP, it’s 
represented by a point on the extreme right of the x-axis. 


4.6 Boxplots 


Boxplots are a compact way of representing the five-number summary described in 
Chapter 1, namely median, first and third quartiles (25th and 75th percentile) and min 
and max. The upper side of the vertical rectangular box represents the third quartile 
and the lower, the first quartile. The difference between the two points is known as 

the interquartile range, which consist of 50% of the data. A line dividing the rectangle 
represents the median. It also contains a line extending on both sides (known as 
whiskers) of the rectangle, which indicate the variability outside the first and third 
quartile. And finally the points plotted, which are shown as extensions of the lines, are 
called outliers. Numerically, these points have a value more than twice the standard 
deviation of the variable. 
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# GDP 


GDP all <-read.csw("Dataset/WDi/GDP All Year.csv") 

GDP_all Long Format <-melt(GDP all, id ="Country") 
names(GDP all Long Format) <-e("Country", "Year", "GDP USD Trillion") 
GDP_all Long Format$Year <-substr(GDP_ all Long Format$Year, 2,length(GDP_ 
all Long Format$Year) ) 


ggplot(GDP all Long Format, aes(factor(Country), GDP USD Trillion)) + 
geom_boxplot(aes(fill =factor(Country) ))+ 
theme(legend.title=element_text(family="Times",size=20), 
legend.text=element_text(family="Times",face ="italic",size=15), 
plot.title=element_text(family="Times", face="bold", size=20), 
axis.title.x=element_text(family="Times", face="bold", size=12), 
axis.title.y=element_text(family="Times", face="bold", size=12)) + 
xlab("Country") + 

ylab("GDP (in Trillion US $)") + 

ggtitle("GDP (in Trillion US $): Boxplot - Top 10 Countries") 
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Figure 4-10. A boxplot showing the GDP (in trillion US$) for the top 10 countries 


A boxplot is a wonderful representation of the degree of dispersion (spread), 
skewness, and outliers in a single plot. Using ggp lot, it’s possible to stack the different 
categories of the variables together side-by-side to see a comparison. For instance, 
looking at Figure 4-10, you see a boxplot of GDP by country. This contains the GDP 
data from 1962 to 2015. You see that the United States has shown the highest level of 
growth (degree of dispersion) with no outliers, indicating a sustained growth with no 
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extreme highs or lows, whereas in China shows a high number of outliers, which roughly 
indicates the country has seen many unpredicted growths between 1962 and 2015. This 
interpretation is prone to error as we haven't looked at the reasons for these outliers in 
the data. An economist might intuitively generate some insights by just glancing at this 
plot; however, a naive analyst might end up producing some erroneous conclusions if 
they didn’t give attention to the details. So, always hold onto the excitement of seeing a 
beautiful visualization and carefully analyze the other statistical properties of the data 
before making conclusions. 


# Population 


Population all <-read.csw("Population All Year.csv") 
Population all Long Format <-melt(Population all, id ="Country") 
names(Population all Long Format) <-e("Country", "Year", "Pop Billion") 
Population all Long Format$Year <-substr(Population all Long Format$Year, 
2,length(Population all Long Format$Year) ) 


ggplot(Population all Long Format, aes(factor(Country), Pop Billion)) + 
geom_boxplot(aes(fill =factor(Country))) + 

theme (legend.title=element_text(family="Times",size=20), 

legend. text=element_text(family="Times",face ="italic",size=15), 
plot.title=element_text(family="Times", face="bold", size=20), 
axis.title.x=element_text(family="Times", face="bold", size=12), 
axis.title.y=element_text(family="Times", face="bold", size=12)) + 
xlab("Country") + 

ylab("Population (in Billion)") + 

ggtitle("Population (in Billion): Boxplot - Top 10 Countries") 


The boxplot for population of these 10 countries (in Figure 4-11) shows a similar 


trend but with no outliers. India and China are clearly emerging as the largest countries in 
terms of population. 
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Population (in Billion): Boxplot - Top 10 Countries 
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Figure 4-11. A boxplot showing the population (in billions) for the top 10 countries 


4.7 Histograms and Density Plots 


A histogram is one of the most basic and easy to understand graphical representations of 
numerical data. It consists of rectangular boxes. The width of each rectangle has a certain 
range and the height signifies the number of data points within that range. Constructing 
a histogram begins with dividing the entire range of values into non-overlapping and 
equal sized smaller bins (the rectangles). Histograms show an estimate of the probability 
distribution of a continuous variable. 

Now imagine if you increase the number of bins to a large number in the histogram. 
What happens as a result is that you get a smooth surface and the rectangles appear to 
diminish into an area with some density. Alternatively, you could also use a density plot. 
Here we will show a histogram and then a density plot separately. 


# Population 


Population all <-read.csv( "Population All Year.csv") 
Population all Long Format <-melt(Population all, id ="Country") 
names(Population all Long Format) <-e("Country", "Year", "Pop Billion") 
Population all Long Format$Year <-substr(Population all Long Format$Year, 
2,length(Population all Long Format$Year) ) 


#Developed Country 


Population Developed <-Population all Long Format[!(Population all Long_ 
Format$Country %inze( ‘India’, ‘China’, ‘Australia’, 'Brazil','Canada','France’, 
‘United States')), ] 


ggplot(Population Developed, aes(Pop Billion, fill = Country)) + 
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geom_histogram(alpha =0.5, aes(y = ..density..),col="black") + 
theme (legend.title=element_text(family="Times",size=20), 

legend. text=element_text(family="Times",face ="italic",size=15), 
plot.title=element_text(family="Times", face="bold", size=20), 
axis.title.x=element_text(family="Times", face="bold", size=12), 
axis.title.y=element_text(family="Times", face="bold", size=12)) + 
xlab("Population (in Billion)") + 

ylab("Frequency") + 

ggtitle("Population (in Billion): Histogram") 


Figures 4-12 shows the distribution of population for three countries—Germany, 
Japan, and the United Kingdom. 


Population (in Billion): Histogram 
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Figure 4-12. A histogram showing GDP and population for three developed countries 


This distribution can be shown in density scales as well; here is the plot showing 
density scales. 


ggplot(Population Developed, aes(Pop Billion, fill = Country)) + 
geom_density(alpha =0.2, col="black") + 
theme(legend.title=element_text(family="Times",size=20), 
legend.text=element_text(family="Times",face ="italic",size=15), 
plot.title=element_text(family="Times", face="bold", size=20), 
axis.title.x=element_text(family="Times", face="bold", size=12), 
axis.title.y=element_text(family="Times", face="bold", size=12)) + 
xlab("Population (in Billion)") + 

ylab("Frequency") + 

getitle("Population (in Billion): Density") 


#Developing Country 
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Population (in Billion): Density 
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Figure 4-13. A density plot showing GDP and population for three developed countries 





Population (in Billioa) 


Population Developing <-Population all Long Format[Population all Long_ 
Format$Country %inze( ‘India’, 'China'), ] 


#Histogram 


ggplot(Population Developing, aes(Pop Billion, fill = Country)) + 
geom_histogram(alpha =0.5, aes(y = ..density..),col="black") + 
theme(legend.title=element_text(family="Times",size=20), 
legend.text=element_text(family="Times",face ="italic",size=15), 
plot.title=element_text(family="Times", face="bold", size=20), 
axis.title.x=element_text(family="Times", face="bold", size=12), 
axis.title.y=element_text(family="Times", face="bold", size=12)) + 
xlab("Population (in Billion)") + 

ylab("Frequency") + 

ggtitle("Population (in Billion): Histogram") 
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Population (in Billion): Histogram 
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Figure 4-14. A histogram showing GDP and population for two developing countries 


#Density 


ggplot(Population Developing, aes(Pop Billion, fill = Country)) + 
geom_density(alpha =0.2, col="black") + 
theme(legend.title=element_text(family="Times",size=20), 

legend. text=element_text(family="Times",face ="italic",size=15), 
plot.title=element_text(family="Times", face="bold", size=20), 
axis.title.x=element_text(family="Times", face="bold", size=12), 
axis. title. y=element_text(family="Times", face="bold", size=12)) + 
xlab("Population (in Billion)") + 

ylab("Frequency") + 

getitle("Population (in Billion): Density Plot") 
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Population (in Billion): Density Plot 
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Figure 4-15. A density plot showing GDP and population for two developing countries 


Looking at these histograms and density plots, you can see over the years, how the 
population data for these developed and developing nations is distributed. Now, since we 
have explored the data in detail, let’s get a little more specific about the indicators based 
on population but split by different cohorts, like country and age. 


48 Pie Charts 


In India, the lowest consumption group spends almost close to 53% of their money on 
food and beverages as compared to the higher consumption group with by far the lowest 
among other groups at 12%. On the other hand, their spending on housing stands at 
39%. This has one clear indication—the lowest consumption group with less disposable 
income spends a lot on basic survival needs like food, whereas the higher income group 
is looking for nice places to buy homes. The middle income group has something very 
similar to the higher group, but they have a larger pie allocated for food as well, which 
stands at 21%. 

So, in India, businesses around real estates and food industry have flourished to an 
all time high in recent years. With a 1.31 billion population base, and a majority of them 
in the lowest, low, or middle income group, India has become a land of opportunity for 
the food industry. 

Another interesting sector is transport, which finds its highest share of contribution 
from the higher income group, which often is making travel plans throughout the year. 
The transport here includes the usual mode of commuting to home and the office as well 
as holiday travels. With the presence of global businesses like Uber, which has solved the 
world’s commuting problems, and with technology being present in more than 28 cities 
of India, this tells us the potential of this sector. 
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# India 


library (reshape2 ) 
library (ggplot2) 


GCD India <-read.csw("India - USD - Percentage.csv") 


GCD India Long Format <-melt(GCD India, id ="Sector") 
names(GCD India Long Format) <-e("Sector", "Income Group","Perc Cont") 


ggplot(data=GCD India Long Format, aes(x=factor(1), fill =factor(Sector))) + 
geom_bar(aes(weight = Perc Cont), width=1) + 

coord_polar(theta="y", start =0) + 

facet_grid(facets=. ~Income Group) + 

scale_fill_brewer(palette="Set3") + 

xlab('') + 

ylab('') + 

labs(fill='Sector') + 

getitle("India - Percentage share of each sector by Consumption Segment") 
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Figure 4-16. A pie chart showing the percentage share of each sector by consumption 
segment in India 


In contrast to India, if you look at China, the need for food and housing is more 
evenly distributed among different income groups, whereas what emerges very 
distinctively in China is the spending on information and communication technologies 
(ICT) by the higher income group, which stands at 14% of the total spend. This puts China 
more into the league of developed nations, where such high adaptability and spend on 
ICT could be seen. 


# China 


library (xreshape2) 
library (ggplot2) 


GCD China <-read.csw("China - USD - Percentage.csv") 


GCD China Long Format <-melt(GCD China, id ="Sector" 
names(GCD China Long Format) <-e("Sector", "Income Group","Perc Cont") 


153 


CHAPTER 4 ™ DATA VISUALIZATION IN R 


ggplot(data=GCD China Long Format, aes(x=factor(1), fill =factor(Sector))) + 
geom_bar(aes(weight = Perc Cont), width=1) + 

coord_polar(theta="y", start =0) + 

facet_grid(facets=. ~Income Group) + 

scale_fill_brewer(palette="Set3") + 

xlab('') + 

ylab('') + 

labs(fill='Sector') + 

getitle("China - Percentage share of each sector by Consumption Segment") 
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Figure 4-17. A pie chart showing the percentage share of each sector by consumption 
segment in China 


The pie chart in Figure 4-17 is very intuitive. Look at the lowest segment, the pie 
chart on extreme left. Almost half of the consumption is for food and beverages, as for the 
poor, the first priority is food. As you move to the higher segment, the priories shift and 
things like education and ICT (computing devices) go up substantially. 


4.9 Correlation Plots 


The best way to show how much one indicator relates to another is by computing the 
correlation. Though we won't go into the details of the mathematics behind correlation, 
those of you who thought that correlations are only seen through a nxn matrix are in 

for a surprise. Here comes the visual representation of it using the corrplot library in 
R.Correlational as a statistical measure is discussed in Chapter 6. 

Corrplot() is a R package that can be used for graphical display of a correlation 
matrix, confidence interval. It also contains some algorithms to do matrix reordering. In 
addition, corrplot is good at details, including choosing color, text labels, color labels, 
layout, etc. 

In this last section of the chapter, we want to tie few development indicators discussed 
in previous sections like GDP and population with some indicators that contribute to 
its growth. The World Bank data used from 1961 to 2014 at an overall world level. For 
instance, fertility rate (births per women) highly correlates to population growth rate. 


library(corrplot) 


library (xreshape2 ) 
library(ggplot2) 
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correlation world <-read.csv("Correlation Data.csv") 


corrplot(cor(correlation world[ ,2:6],method ="pearson"),diag =FALSE, 
title ="Correlation Plot", method ="ellipse’, 
tl.cex =0.7, tl.col ="black", cl.ratio =0.2 


) 
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Figure 4-18. A plot showing correlation between various world development indicators 
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There are many methods with the corrplot function (the method used in 
the cor function defines which correlation measure to use; here we use a Pearson 
correlation) with which you can experiment to see different shapes in this plot. 
We prefer the “ellipse,” for two reasons. The ellipse can give us size and directional 
elements to capture more information. The combination of color, size, and position 
encapsulates a numeric value into a visual representation. For example, a correlation 
between fertility rate and population growth has a value greater than 0 (reflected in 
the shades of blue), and the direction of the ellipse represents a positive or negative 
correlation. The size represents the value; a thin ellipse would mean either a low or 
negative correlation and vice versa. 
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This way of leveraging the color, shape, and position gives us more dimensions to 
present a visualization in 2D, which otherwise would have been difficult to visualize. 
Some of the insights we get from this plot without even looking at the correlation matrix 
are as follows: 


e = As the fertility rates go down, we can see an increased life 
expectancy. 


e With increase in the life expectancy, people start to live longer 
and there is greater burden on the economy to meet their 
healthcare needs. As a result, we see its negative correlation with 
GDP growth rate. Although it would be a gross mistake to say 
it’s only the increase in life expectancy causing the GDP growth 
rate to go down, it is fair to point out the negative correlation that 
exists between the two variables. 


e Anincrease in females also shows a positive (although not too 
high) correlation with GDP growth rate. This might mean that 
female contribution in household income growth and hence the 
spending increase has some effect on the countries GDP. 


4.10 HeatMaps 


Carrying on with the indicators and their correlations in the last section, Heatmaps are 
visualization of data where values are represented as different shades of colors, darker the 
shade, higher is the value. For example, it would help us visualize how different regions of 
the world are responding to the development indicators. 

The heatmap in Figure 4-19 shows six development indicators and how its scaled 
values (between 0 to 1) compare for different regions. Some insights we could derive from 
this heatmap are: 


e The East Asia and Pacific region has the world’s highest 
population (mostly contributed by China), followed by South Asia 
(contribution from India). 


e North America, with its very low population, has the highest GDP 
per capita value (GDP/Population). It also has the lowest fertility 
rate and highest life expectancy, which comes from the fact that 
both of these indicators are highly correlated. Sub-Saharan Africa 
has the lowest GDP and GDP per capita. 


e Interestingly, life expectancy throughout the world is now looks 
healthy in terms of its scaled value. This perhaps is because of 
the improved healthcare services and reduced fertility rates. 

So, it seems most of the countries in the world are able to use 
contraceptives and enjoy the economic benefits of a smaller 
family. 
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library (corrplot) 
library (xreshape2 ) 
library (ggplot2) 


library ("scales") 
#Heat Maps 


bc <-read.csw("Region Wise Data.csv") 


bc_ long form <-melt(bc, id =e("Region", "Indicator") ) 


names(bc long form) <-e("Region", "Indicator","Year", "Inc Value") 


bc_long form$Year <-substr(bc long form$Year, 2,length(bc_long form$Year) ) 


bc_long form rs <-ddply(bc long form, .(Indicator), transform ,rescale 
=rescale(Inc Value) ) 


ggplot(bc long form rs, aes(Indicator, Region)) +geom_tile(aes(fill 

= rescale),colour ="white") +scale_fill_gradient(low ="white",high 
="steelblue") + 

theme_grey(base size =11) +scale_x_discrete(expand =c(0, 0)) + 
scale_y discrete(expand =c(0, 0)) + 

theme ( 

axis.text.x =element_text(size =15 *0.8, angle =330, hjust =0, colour 
="black", face="bold"), 

axis.text.y =element_text(size =15 *0.8, colour ="black",face="bold") )+ 
getitle("Heatmap - Region Vs World Development Indicators") + 

theme (text=element_text(size=12), 

title=element_text(size=14, face="bold") ) 
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Figure 4-19. A heatmap between regions and their various world development indicators 


4.11 Bubble Charts 


In order to appreciate bubble charts, you need to first watch the TED talk by Hans 
Rosling, called “The best stats you've ever seen’: He is a Swedish medical doctor, 
academic, statistician, and public speaker. Hans co-founded Gapminder Foundations, a 
non-profit organization promoting the use of data to explore development issues. They 
came out with software named Trendalyzer, which was later acquired by Google and 
rebranded as googleViz or otherwise known as Google Motion Charts. Google didn't 
commercialize this product, but rather made it available free publicly. 

In this section, we use a dataset made available by Gapminder, which has the data 
around continent, country, life expectancy, and GDP per capita from 1995 to 2007. 
Though it looks good in 2D and in static charts, it’s a visual delight to see these bubbles 
move in a motion chart. 


library(corrplot) 
library (xreshape2 ) 
library(ggplot2) 
library("scales") 


#Bubble chart 


bc <-read.delim("BubbleChart_GapMInderData.txt") 


bc clean <-droplevels(subset(bc, continent != "Oceania")) 
str(bc_ clean) 
‘data. frame’: 1680 obs. of 6 variables: 
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$ country : Factor w/ 140 levels "“Afghanistan",..: 1111111111 


$ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ... 

$ pop : num 8425333 9240934 10267083 11537966 13079460 ... 

$ continent: Factor w/ 4 levels "Africa", "Americas",..: 3 3 3 3 3 3 3 3 3 
3 ees 
$ lifeExp : num 28.8 30.3 32 34 36.1... 
$ gdpPercap: num 779 821 853 836 740 ... 
bc clean subset <-subset(bc clean, year ==2007) 
bc clean subset$year =as.factor(bc clean subset$year) 


ggplot(bc clean subset, aes(x = gdpPercap, y = lifeExp)) +scale_x_log10() + 
geom_point(aes(size =sqrt(pop/pi)), pch =21, show.legend =FALSE) + 
scale_size_continuous(range=c(1,40)) + 

facet_wrap(~continent) + 

aes(fill = continent) + 

scale_fill_manual(values =c("#FAB25B", "#276419", "#529624", "#C6E79C")) + 
xlab("GDP Per Capita(in US $)")+ 

ylab("Life Expectancy(in years)")+ 

getitle("Bubble Chart - GDP Per Capita Vs Life Expectancy") + 

theme (text=element_text(size=12), 

title=element_text(size=14, face="bold") ) 
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Figure 4-20. A bubble chart showing GDP per capita vs life expectancy 
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The book, Lattice: Multivariate Data Visualization with R available via SpringerLink, 
by Deepayan Sarkar, Springer (2008) has a comprehensive analysis on bubble charts. 
Readers who want a deeper understanding of this visualization may refer to this book. 

The bubble chart in Figure 4-20 shows the plot between life expectancy and GDP 
per capita for the year 2007. The size of the bubble indicates the population of countries 
in that continent. The bigger the bubble size, the larger the population. You can see that 
Asia contains multiple large bubbles because of the India and China presence, whereas 
Europe consists of mostly less populated countries and a high GDP per capita and life 
expectancy. America has some densely populated areas at the same time as a high value 
for both the indicators. 


library(corrplot) 
library (reshape? ) 
library(ggplot2) 
library("scales") 


bc <-read.csw("Bubble Chart.csv") 


ggplot(bc, aes(x = GDPPerCapita, y = LifeExpectancy)) +scale_x_log10() + 
geom_point(aes(size =sqrt(Population/pi)), pch =21, show.legend =FALSE) + 
scale_size_continuous(range=c(1,40)) + 

facet_wrap(~Country) + 

aes(fill = Country) + 

xlab("GDP Per Capita(in US $)")+ 

ylab("Life Expectancy(in years)")+ 

getitle("Bubble Chart - GDP Per Capita Vs Life Expectancy - Four Countries") 
4 

theme(text=element_text(size=12), 

title=element_text(size=14, face="bold")) 
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Bubble Chart - GDP Per Captita Vs Life Expectency - Four Countries 
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Figure 4-21. A bubble chart showing GDP per capita vs life expectancy for four countries 


The bubble chart in Figure 4-21 is for the two most developed and two fastest 
developing countries. Note that the developing nations, China and India, are quickly 
catching up in GDP and life expectancy to the developed nations over the years, despite 
their large population. 


library (corrplot) 
library (reshape2 ) 
library (ggplot2) 
library("scales") 


bc <-read.csv("Bubble Chart.csv") 


geplot(bc, aes(y = FertilityRate, x = LifeExpectancy)) +scale_x_log10() + 
geom_point(aes(size =sqrt(Population/pi)), pch =21, show.legend =FALSE) + 
scale_size_continuous(range=c(1,40)) + 

facet_wrap(~Country) + 

aes(fill = Country) + 

ylab("Fertility rate, total (births per woman)")+ 

xlab("Life Expectancy(in years)")+ 

getitle("Bubble Chart - Fertility rate Vs Life Expectancy") + 

theme (text=element_text(size=12), 

title=element_text(size=14, face="bold") ) 
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Bubble Chart - Fertility rate Vs Life Expectency 
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Figure 4-22. A bubble chart showing fertility rate vs life expectancy 


It’s evident from the chart in Figure 4-22 that, with decreasing fertility rates, the life 
expectancy is getting longer for all these four nations. India steadily reduced the gap 
between itself and China in terms of life expectancy. There were 14 years between them 
to begin with and it fell to 6 or 7 years. 


4.12 Waterfall Charts 


A waterfall chart helps visualize the cumulative effect of sequential changes (addition 
and deletion) in the values. Just like waterflow, it shows the flow of values in and out of 
the main values. Waterfall charts are also known as flying bricks charts or Mario charts 
due to the apparent suspension of columns (bricks) in midair. They are very popular in 
accounting and stock management visualizations, as the quantities keep on changing in 
a sequential manner. We will be using the package waterfall to create an example on 
hypothetical data of border control. 

Waterfall() is an R package that provides support for creating waterfall charts in 
R using both traditional base and lattice graphics. The package details can be found at 
https://cran.r-project.org/web/packages/waterfall/waterfall. pdf. 

The data we have is of border control, where each month the footfall of people is 
counted. More people going out than coming in means the net migration is negative, and 
when more people come in than out, the migration is positive. If we record this exchange 
over the border for 12 months, we can see the net migration. The waterfall chart will show 
us how it changed over these months. 
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#Read the Footfall Data 
footfall <-read.csv("Dataset/Waterfall Shop Footfall Data.csv",header = T) 


#Display the data for easy read 

footfall 

#Convert the Months into factors to retain the order in chart 
footfall$Month <-factor(footfall$Month) 

footfall$Time Period <-factor(footfall$Time Period) 


#Load waterfall library 
library (waterfall) 
library (lattice) 


#Plot using waterfall 


waterfallplot(footfall$Net,names.arg=footfall$Month, xlab ="Time Period(Mont 
h)",ylab="Footfall",col = footfall$Type,main="Footfall by Month") 
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Figure 4-23. Waterfall plot of footfall at the border 


The green blocks are starting or ending blocks, corresponding to January and 
December, respectively. The red blocks are people coming in while black blocks are 
people going out. When you follow this over a year, you can see that positive migration 
happened most of the year, except in three months where more people went out (the 
black blocks). 

The following plot is alternative view of the same waterfall charts (see Figure 4-24). 


waterfallchart(Net~Time Period, data=footfall,col = footfall$Type,xlab 
="Time Period(Month)",ylab="Footfall",main="Footfall by Month") 
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Footfall by Month 
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Figure 4-24. Waterfall chart with net effect 


The plot in Figure 4-24 is similar to the previous one, with the only difference of the 
total column at the end. The total column presents the final net value in our counter of 
footfall after the year ended. 


The same plot can be created to show the percentage of footfall contribution by month. 
This will show how the ending footfall count each month is proportional to the total end 
footfall. The sum of such percentage should be 100 and is divided into 12 months. 


waterfallchart(Month~Footfall End Percent, data=footfall) 
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Figure 4-25. Footfall end count as percentage of total end count 
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Note that the end count fluctuated during the month of April, followed by March and 
November. The interpretation will vary based on what you are more interested in from the 
plots in Figures 4-24 and 4-25. 


4.13 Dendogram 


Dendograms are visual representations specifically useful in clustering analysis. They 
are tree diagrams frequently used to illustrate the formation of clusters as is done in 
hierarchical clusters. Chapter 6 explains how hierarchical clustering works. Dendograms 
are popular in computational biology where similarities among species can be presented 
using histograms to classify them. 

Dendograms are native to the basic plot () command. There are some other 
packages as well for more detailed dendograms like ggdendro() and dendextend(). 

The y-axis in dendograms measures the closeness (or similarity) of an individual 
data point of clusters. 

The x-axis lists the elements in the dataset (and hence they look messy on the leaf nodes). 

The dendogram helps in choosing the right numbers of clusters by showing how 
the tree grows with distance matrix (or height) on the y-axis. Cut the tree where you feel 
substantially separated clusters can be seen on dendogram. A cut means a like y=c, where 
cis 1, 2, or 3..n and c is the number of clusters. 

Here, we create a example with iris data, and in the end show how good the clusters 
fit to the actual data. 


library (ggplot2) 

data(iris) 

# prepare hierarchical cluster on iris data 
hc <-helust(dist(iris| ,1:2])) 


# using dendogram objects 
hcd <-as.dendrogram(hc) 


#Zoom Into it at level 1 
plot(cut(hcd, h =1)$upper, main ="Upper tree of cut at h=1") 
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Upper tree of cut at h=1 
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Figure 4-26. Dendogram with distance/height up to h=1 


Looking at the dendogram in Figure 4-26, the best cut seems like it will be 
somewhere between 2 and 3, as the clusters have to be complete. We will go ahead with 
three clusters and see how they fit into our prior knowledge of clusters. 


#lets show how cluster looks looks like if we have cut the tree at y=3 
clusterCut <-cutree(hc, 3) 


iris$¢clusterCut <-as.factor(clusterCut) 
ggplot(iris, aes(Petal.Length, Petal.Width, color = iris$Species)) + 


geom_point(alpha =0.4, size =3.5) +geom_point(col = clusterCut) + 
scale_color_manual(values =c('black', ‘red', 'green')) 


166 


CHAPTER 4 ™ DATA VISUALIZATION IN R 


2.0- 





c15- inis$Species 
£ ® setosa 
= D versicolor 
© w © virginica 
1.0- 
= 
0.5- i 
a ge © 
aTe = 
© aa © 
® @ 


Ro = 


Petal.Length 


Figure 4-27. Clusters by actual classification of species in iris data 


We can see in the plots in Figure 4-27 that most of the clusters we predicted and the 
already existing classification of species match. This also means the variables we use for 
clustering petal width and petal length are important features for the type of species they 
belong to. 


4.14 Wordclouds 


Wordclouds are word plots with frequency weighted to the size of the words. The more 
frequently a word appears, the bigger the word. You can look at text data and quickly 
identify the most prominent themes discussed. The earliest example of weighted lists of 
English keywords were the "subconscious files" in Douglas Coupland's Microserfs (1995). 
After that, this has become a prominent way of quickly perceiving the most frequent 
terms and for locating a word alphabetically to determine its relative importance. 

In R, we have package wordcloud(), which is used in this section to create a 
wordcloud. The details of this package are available at https: //cran.r-project.org/ 
web/packages/wordcloud/wordcloud. pdf. 

In this section, we show a good example of how wordclouds can be useful. We have 
just copied multiple job descriptions from the Internet for a data science position. Now 
the wordcloud on this document will tell us which words occur most frequency in the job 
descriptions and hence give us an idea about what the hot skills in market are and the 
demand of other qualities. 


#Load the text file 


job desc <-readLines("Dataset/wordcloud.txt") 
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library(tm) 
library (Snowbal1C) 
library (wordcloud) 
Loading required package: RColorBrewer 
jeopCorpus <-Corpus(WectorSource(job desc) ) 


jeopCorpus <-tm_map(jeopCorpus, PlainTextDocument) 

#jeopCorpus <- tm _map(jeopCorpus, content_transformer(tolower)) 

#Remove punctuation marks 

jeopCorpus <-tm_map(jeopCorpus, removePunctuation) 

#remove English stopwords and some more custom words 

jeopCorpus <-tm_map(jeopCorpus, removeWords, (¢("Data", "data", "Experi", 
,"develop","use","will","can","you", "busi", stopwords('english')))) 

#Create the document matrix 


jeopCorpus <-tm_map(jeopCorpus, stemDocument) 


work" 


#Creating the color pellet for the word images 


pal <-brewer.pal(9,"Y1GnBu" ) 

pal <-pal[-(1:4)] 

set.seed(146) 

#creating the wordcloud 

wordcloud(words = jeopCorpus, scale=¢(3,0.5), max.words=100, random. 
order=FALSE, 

rot.per=0.10, use.r.layout=FALSE, colors=pal) 
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Figure 4-28. Wordcloud of job descriptions 
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The wordcloud shows that the key trends in data science positions are experienced 
people, analyst positions, Hadoop, statistics, Python, and others. This way, without even 
going through all the data, we have been able to extract the prominent requirements for a 
data science position. 


4.15 Sankey Plots 


Sankey plots are also called river plots. They are used to show how the different elements 
of data are connected, with the density of connecting lines presenting the strength of 
connection. They help show the flow of connected items from one factor to another. 

It is highly recommended that users explore a powerful visualization package for 
making lot of beautiful charts in R: Rcharts(). The source of this package, with lots of 
examples, can be found at https: //github.com/ramnathv/rCharts. 

In the following example, we will use another powerful visualization tool, 
googleVis(). GoogleVis is an R interface to the Google Charts API, allowing users 
to create interactive charts based on data frames. Charts are displayed locally via the 
R HTTP help server. A modern browser with an Internet connection is required and 
for some charts a Flash player. The data remains local and is not uploaded to Google. 
(Source: https: //cran.r-project.org/web/packages/googleVis/googleVis. pdf) 

In our example, we will show how the HousePrice flows among different attributes; 
we have chosen three layers of plot with Type of House, Estate type, and Type of Sale. 


#Load the data from sankey.csv 
sankey data <-read.csw("Dataset/sankey2.csv",header=T) 


library (googleVis) 

plot( 

gvisSankey(sankey data, from="Start", 

to="End", weight="Weight", 

options=list( 

height=250, 

sankey="{link:{color:{fill: 'lightblue' }}}" 
)) 

) 


Note The visualization is loaded on a web browser, so you don’t need a working 
Internet connection to load this example. 


starting httpd help server ... 
done 
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Figure 4-29. The Sankey chart for house sale data 


The Sankey chart provides us with some important information, like the most 
popular house type is the individual house. They then are available in all the type of 
states. Further the societies only have individual house and they have gone through new 
house sale, second resale, and third resale only. You can use these plots to explain a lot of 
other insights as well. 


4.16 Time Series Graphs 


We have already shown time series plots in earlier sections in this chapter. Essentially, 
when the data is time indexed, like GDP data, we take time on the x-axis and plot the 
data to see how it has been changing over time. We can use time series plots to evaluate 
patterns and behavior in data over time. 

R has powerful libraries to plot multiple types of time series plots. A good read 
for you can be found at https: //cran.r-project.org/web/packages/timeSeries/ 
vignettes/timeSeriesPlot. pdf. 

For our example, we will try to show two time plots to understand some stark 
behavior: 


e GDP of eight countries overlayed on a single plot to show how the 
GDP growth varied for these countries over the last 25 years. 


e Tracing the GDP growth of three countries during the recession 
of 2008. 


The first example plots GDP growth over 25 years for eight countries/areas (the Arab 
world, UAE, Australia, Bangladesh, Spain, United Kingdom, India, and the United States). 
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library (xreshape2) 

library (ggplot2) 

library (ggrepel) 

time series <-read.csw("Dataset/timeseries.csv",header=TRUE) ; 
mdf <-melt(time series,id.vars="Year"); 

mdf¢$Date <-as.Date(mdf$Year, format="%d/%m/ZY" ) ; 


names (mdf)=c("Year", "Country", "GDP Growth", "Date"); 


ggplot(data=mdf ,aes(x=Date, y=GDP Growth)) +geom_line(aes(color=Country) ,size=1.5) 
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Figure 4-30. GDP growth for eight countries 


The plot in Figure 4-30 shows that the most volatile economy among the eight 
countries is UAE. They showed phenomenal growth after the 1990s. You can also see 
during 2007-2009 that all the economies showed lower GDP growth, due to a worldwide 
recession. 

In the following plots, we will see how the recessions impacted three major 
economies and to what extent (United States, the UK, and India). 


#Now lets just see the growth rates for India, US and UK during recession 
years (2006, 2007, 2008, 2009, 2010) 
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mdf 2 <-mdf[mdf$Country “in%e("India", "United.States", "United.Kingdom" ) 
&(mdf$Date >as.Date("2005-01-01") &mdf$Date <as.Date("2011-01-01")), | 


mdf 2$GDP Growth <-round(mdf_ 2$GDP Growth, 2) 
tp <-ggplot(data=mdf_2,aes(x=Date,y=GDP Growth)) +geom_line(aes(color=Count 


ry),size=1.5) 
tp +geom_text_repel(aes(label=GDP Growth) ) 
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Figure 4-31. GDP growth during recession 


You can see in 2008, that the United States and the UK showed negative growth, 
while India’s growth slowed but was not negative. The United States and the UK were 
in a deep recession in 2009 as well, while India started picking up. After 2009, all the 
economies were on the recovery path. 


4.17 Cohort Diagrams 


Cohort diagrams are two-dimensional diagrams used to present events that occur to a 
set of observations (individuals) belonging to different cohorts. They are very popular in 
credit analysis, marketing analysis, and other demographic studies. Cohort diagrams are 
also sometimes called Lexis diagrams. 

A cohort is a group of people that form a group that’s assumed to behave differently 
than others based on demographics. In our credit example, we assume the cohorts as 
the year in which credit was issued. This means each year applicants will be treated as a 
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cohort and then we track how many of them still remain unpaid in the following years. In 
cohort plots time is usually represented on the horizontal axis, while the value of interest 
is represented on the vertical axis. 

Let’s create the cohort diagram for our credit example. 


library (ggplot2) 
require(plyr) 


cohort <-read.csw("Dataset/cohort.csv",header=TRUE ) 


#we need to melt data 
cohort.chart <-melt(cohort, id.vars ="Credit_ Issued") 
colnames(cohort.chart) <-e('Credit Issued’, 'Year Active’, ‘Active Num') 


cohort.chart$Credit Issued <-factor(cohort.chart$Credit Issued) 


#define palette 
blues <-colorRampPalette(c('lightblue', '‘darkblue')) 


#plot data 

p <-ggplot(cohort.chart, aes(x=Year Active, y=Active Num, group=Credit Issued) ) 
p +geom_area(aes(fill = Credit Issued)) + 

scale_fill_manual(values =blues(nrow(cohort))) + 

getitle('Active Credit Cards Volume’ ) 
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Figure 4-32. The cohort plot for credit card active by year of issue 
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The plot in Figure 4-32 shows how each cohort volume changes with the number of 
active years. You can see how the active number of cards decreases over the years. The 
decline rate can be estimated by the slope of each cohort and can be tested against others 
to see if some particular cohort behaved differently. 


4.18 Spatial Maps 


Spatial maps have become very popular in recent days. They are powerful presentations 
of data that’s tagged with locations on a map. If the information is geotagged, we can 
create powerful visual presentations of the data. You can see lots of applications of 
them—weather reporting, demographics, crime monitoring, trails monitoring, and 
some very interesting crowd behavior tracking using Twitter data, Flikr data, and other 
geotagged personal data. 

We recommend a good read, available at https: //journal.r-project.org/ 
archive/2013-1/kahle-wickham. pdf. 

To show an example, we have selected the crime records data from the National 
Crime Records Bureau, India. We show how the robbery cases across the states can be 
shown on an Indian map. This will help us compare data relatively without getting into 
the data itself. 

Data source: https: //data.gov.in/catalog/cases-reported-and-value- 
property-stolen-place-occurrence 

The Ggmap() package is used for spatial visualization along with ggplot2. Itis a 
collection of functions to visualize spatial data and models on top of static maps from 
various online sources (e.g., Google Maps and Stamen Maps). It includes tools common 
to those tasks, including functions for geolocation and routing. (Source: https: //cran.r- 
project.org/web/packages/ggmap/ggmap . pdt) 

Let’s walk through each step in detail: 


1. Load the crime data into crime _ data: 


crime data <-read.csv("Dataset/Case reported and value of property taken_ 
away.csv",header=T) 


#install.packages("ggmap") 
library (ggmap) 


2. Pull an example map to check if the ggplot() function is 
working or not: 


#Example map to test if ggmap is able to pull graphs or not 
qmap(location ="New Delhi, India") 

Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=New 
+Delhi, +India&zoom=108%size=640x640&scale=2&maptype=terrain&language=en- 
EN&sensor=false 

Information from URL : http://maps.googleapis.com/maps/api/geocode/ 
json?address=New420Delhi , 420India&sensor=false 
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Figure 4-33. An example map pulled using ggplot()—New Delhi India 


The plot in Figure 4-33 confirms that the ggmap() can pull the maps by passing the 
location into the function. 


3. Get the geolocation of all the states of India present in the 


crime data: 


crime data$geo location <-as.character(crime data$geo location) 
crime data$robbery =as.numeric(crime data$robbery) 


#lets just see the stats fpr 2010 
mydata <-crime data[crime data$year == '2010', | 


#Summarise the data by state 


library (dplyr) 


mydata <-summarise(group_by(mydata, geo location), robbery 


count=sum(robbery) ) 
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#get Geop code for all the cities 


for (i in 1:nrow(mydata)) { 
latlon =geocode(mydata$geo location[i]) 
mydata$lon[i] =as.numeric(latlon[1]) 
mydata$lat[i] =as.numeric(latlon[2]) 
} 
Information from URL : http://maps.googleapis.com/maps/api/geocode/ 
json?address=A8N%20Islands ,420India&sensor=false 
Information from URL : http://maps.googleapis.com/maps/api/geocode/ 
json?address=West%20Bengal , 420India&sensor=false 


4. Here you can see that each state has been geotagged with the 
central coordinates in longitude and latitude duplets. 


head(mydata) 
# A tibble: 6 x 4 
geo location robbery count lon lat 
<chr><db1><db1><db1> 
1 A&N Islands, India 14 10.89779 48.37054 
2 Andhra Pradesh, India 1120 79.73999 15.91290 
3 Arunachal Pradesh, India 138 94.72775 28.21800 
4 Assam, India 1330 92.93757 26.20060 
5 Bihar, India 3106 85.31312 25.09607 
6 Chandigarh, India 134 76.77942 30.73331 


#write the data with geocode for future reference 
mydata <-mydata|[-8, | 
row.names(mydata) <-NULL 


write.csw(mydata,"Dataset/Crime Data for 2010 from NCRB with geocodes. 
csv",row.names =FALSE) 


5. Creating a data frame, with an aggregated number of 
robberies in the state, its longitude, and its latitude. 


Robbery By State =data.frame(mydata$robbery count, mydata$lon, mydata$lat) 


colnames(Robbery By State) <-e('robbery','lon','lat') 


6. Find the center of India on the map and then pull the map of 
India to store in IndiaMap. 


india center =as.numeric(geocode("India") ) 

Information from URL : http://maps.googleapis.com/maps/api/geocode/json?add 
ress=India&sensor=false 

IndiaMap <-ggmap(get_googlemap(center=india center, scale=2, zoom=5,maptype 
='terrain')); 

Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=20. 5936 
84, 78.96288&Z00m=5&s1ze=640x6408scale=2&maptype=terrain&sensor=false 


176 


CHAPTER 4 ™ DATA VISUALIZATION IN R 


7. Plot the India map overlayed by orange circles showing the 
robbery count for each state. The bigger the circle, the higher 
the robbery rate: 


circle scale amt <-0.005 
IndiaMap +geom_point(data=Robbery By State,aes(x=lon,y=lat), col="orange", 


alpha=0.4, size=Robbery By State$robbery*circle scale amt) +scale_size_ 
continuous (range=range(mv_num_collisions$robbery) ) 


Google 
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Figure 4-34. India map with robbery counts in 2010 


Looking at the spatial visualizations, it’s easy to see the distribution of robbery cases in 
India. We can quickly do the comparative analysis also by state. In the plots in Figure 4-34, 
Maharashtra, then UP, and then Bihar top the list of robberies registered in 2010. 
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4.19 Summary 


Data visualization is an art and science at the same time. What information to show 
comes from scientific reasoning while how to show it comes from the cognitive 
capabilities of brain. It is proved that the brain processes images faster than numbers, so 
it becomes very important for a professional to compress the information in meaningful 
visuals rather than long data feeds. 

In this chapter, we discussed many types of visualization plots and charts that can be 
used to build a story around what the data is telling us. We started with World Bank data 
and showed how to track the changes in key indicators using line charts and columns 
charts. We also saw how histograms and density plots save us from generalization our 
inferences by looking at overall levels, as histograms show the distribution within. Pie 
charts are a good way to show the contribution of individual components. Boxplots were 
used to show the extreme values in our dataset. Overall, the correlation plot, heatmaps, 
and finally bubble charts have many commonalities in terms of the rich information 
they show in a relatively small real estate of a chart. While similarities exist, you need to 
carefully choose the right graphs and plots to represent your data. 

Waterfall charts were used to show how a sequential flow of information can be 
captured in more intuitive ways. Similar to waterfall charts are the Sankey plots, drawn 
for different purposes. Sankey plots show properties of connection among different 
components in a flow visualization. Dendograms have specific uses in clustering and 
analysis of similarities among subjects. Time series plots are very important for time- 
indexed data; using time series plots enables you to see how in recession years the GDP 
growth went negative among three countries. 

Another popular chart is the cohort chart. These charts are very popular in analyzing 
groups of people over time for some key characteristic changes. We used a credit card 
example where different cohorts were shown on different time periods from issuance of 
credit. The last and one of most powerful charts are spatial maps. They are presentations 
of information on maps. Any data that’s geotagged can be presented using spatial maps. 
Overall, R has scalable libraries to create powerful visualizations. 

It’s of foremost importance that you understand the audience of your presentation 
before choosing an appropriate visualization technique. As stated, data visualization 
has a vast scope and we will continuously use many such plots, charts, and graphs 
throughout the book. 

In the next chapter, we will explore another aspect of data exploration, which is 
feature engineering. If we have hundreds and thousands of variables or features, how do 
we decide which particular feature is useful in building a ML model? Such questions will 
be answered to set the stage to start building our ML model in Chapter 6. 
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CHAPTER 5 


Feature Engineering 





In machine learning, feature engineering is a blanket term covering both statistical and 
business judgment aspects of modeling real-world problems. Feature engineering is a 
new term coined recently to give due importance to the domain knowledge required to 
select sets of features for machine learning algorithms. It is one of the reasons that most 
of the machine learning professionals call it an informal process. In this chapter, we will 
provide an easy-to-use guide of key terms and methodology used in feature engineering. 
The chapter will give due weight to the domain knowledge and some common business 
limitations while using machine learning algorithms to solve business problems. 

The discussions will throw light on both aspects of feature engineering: 


e Domain knowledge and business limitations 
e Statistical principles 


Before we set the layout for learning objectives of this chapter, let’s spend some 
time understanding how feature engineering is different from what we learned so far in 
previous chapters. We will explain it with two questions: 


e What are my features and their properties? 
e How do my features interact with each other to fit a model? 


In order to quantify meaningful relationships between the response variable and 
predictor variables, we need to know the individual properties of the features and how 
they interact with each other. Descriptive statistics and distribution of features provide us 
with insight into what they are and how they behave in our dataset. Our previous chapters 
have addressed this first question. 

The next step in machine learning involves asking questions and choosing the right 
set of features (or variables) and the criteria to choose them. These questions cannot 
be answered by just studying the individual properties of the features, but we need to 
understand their interactions with each other and with response variable. That is what we 
have to search the answer for the second question and quantify the relations to get a set of 
features that are best for the machine learning algorithm. 

Learning objectives: 


e Introduction to feature engineering 


e Feature ranking 
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e Variable subset selection 
e Dimensionality reduction 


The chapter discusses some hands-on examples to apply the general statistical method 
to these concepts within the feature engineering space. The later part of the chapter will 
discuss some examples to show how business-critical thinking helps feature selection. 

The illustrations in this chapter are based on loan default data. The data contain the 
loss on a given loan. The loss on each loan is graded between 1 and 100. For the cases 
where the full loan was recovered, the value of loan loss is set to 0, which means there was 
no default on that loan. A loss of 60 means that only 40% of the loan was recovered. The 
data is set up to create a default prediction model. 

The data feature names are annonymized to bring focus on the statistical 
quantification of relationship among features. There are some key terms associated with 
the loan default in financial services industry, Probability of Default (PD), Exposure at 
Default (EAD), and Loss Given Default (LGD). While the focus of this chapter is to show 
how statistical methods work, you are encouraged to draw parallel analogies to your 
business problems, in which case a good reference point could be loan default. 


5.1 Introduction to Feature Engineering 


Feature engineering has become a core process in developing any data solution. The 
emergence of feature engineering as an integral part of the machine learning solution 
development is mainly driven by two factors: 


e Increase in a set of features/variables 
e Time and complexity of machine learning algorithms 


With technological advances, it’s now possible to collect a lot of data at just a fraction 
of the cost. In many cases, to improve modeling output, we are merging lots of data from 
third-party sources, external open sources into internal data. This create huge sets of 
features for machine learning algorithms. All the features in our consideration set might 
not be important from a machine learning perspective and, even if they are, all of them 
might not be needed to attain a level of confidence in model predictions. 

The other aspect is time and complexity; the machine learning algorithms are 
resource intensive and time increases exponentially for each feature added to the model. 
A data scientist has to bring in a balance between this complexity and benefit in the final 
model accuracy. 

To completely understand the feature engineering concepts, we have to decouple 
this terminology into two separate but supporting processes: 


e Feature selection (or variable selection) 
e Business/domain knowledge 


The former is statistics-intensive and provides empirical evidence as to why a certain 
feature or set of features is important for the machine learning algorithm. This is based on 
quantifiable and comparable metrics created either independent of the response variable 
or otherwise. The later is more to put the business logic to make sure the features make 
sense and provide the right insights the business is looking for. 
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In many cases, business logic takes precedence over statistical results. This 
precedence is not a hard and fast rule but business insights are not always driven 
by sound statistical results. When there is a conflict, business requirements take 
precedence over statistical inferences. For instance, suppose the unemployment rate 
is used for identifying loan defaults in a region. For the set of data, it might be possible 
that unemployment rate might not be significant at the 95% confidence level, but is 
significant at the 90% confidence level. If a business believes that unemployment rate is 
an important variable, then we might want to create an exception in the variable selection 
where the unemployment rate is captured with relaxed statistical constraints. 

Business/domain knowledge varies with industry and application. Business needs 
are evolving and are very difficult to capture in a time-bound manner. We will discuss 
an example from the financial services domain to explain how variable selection and 
domain knowledge come together in deciding which features to use in the model. The 
main focus of the chapter is on statistical aspects of feature engineering, which we discuss 
under the sections of variable selection and feature creation. 

The main benefits that come out of a robust and structured variable selection are: 


e Improved predictive performance of the model 
e Faster and less complex machine learning process 
e Better understanding of underlying data relationships 


e Explainable and implementable machine learning 
models/solutions 


The first three benefits are intuitive and can be relayed back to our prior discussion. 
Let's invest some time to give due importance to the fourth point here. Business insights 
are generally driven from simple and explainable models. The more complicated a 
machine is, the more difficult it is to explain. Try to think about features as business 
action points. If the machine being built has features that cannot be explained in clear 
terms back to the business, the business loses the value as the model output doesn’t back 
to actionable points. That means the whole purpose of machine learning is lost. 

Any model that you develop has to be deployed in the live environment for use by 
the end users. For a live environment, each added feature in the model means an added 
data feed into the live system, which in turn may mean accessing a whole new database. 
This creates a lot of IT system changes and dependencies within the system. The 
implementation and maintenance costs then have to be weighted upon the inalienability 
of the model and the essence of keeping so many variables. If the same underlying 
behavior can be explained with fewer features, implementation should be done with 
fewer features. Agility to compute and provide quick results often outweighs a better 
model with more features. 

The feature selection methods are broadly divided into three groups—filter, wrapper, 
and embedded. 
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5.1.1 Filter Methods 


Filter methods select variables regardless of the model. They put the features in an 
ordinal list by general features like correlation with the variable to predict or the variance 
in them. The ranked features then provide a list to make a decision of keeping or 
removing features based on ranks. Filter methods are often univariate and consider the 
features independently of other features. The scoring can be done by univariate or with 
regard to the dependent variable. 

Some of the best known filter techniques include chi square test, correlation 
coefficients, and information gain metrics. For example, we know that high variance 
in the data normally reflects more information in it. In filter methods, we can filter out 
the features that have low variance and keep the ones with high variance for further 
analysis. 


5.1.2 Wrapper Methods 


Wrapper methods consider a set of features to find the best subset of features for a 
modeling problem. This method treats the features selection process as a search problem, 
where different combinations of features are tested against performance criteria and 
compared with other combinations. A predictive model is used to evaluate the different 
sets of features and an accuracy metric is used to score the set of features. The set of 
features with the highest accuracy measure is chosen for modeling. 

The search process may use heuristics like forward selection, backward selection, 
and so on, or be probabilistic such as random hill-climbing algorithm. Or it may also 
methodological, like best-fit search or full brute force search. Another advanced example 
of a wrapper method is the recursive feature elimination algorithm. A simple example can 
be constructed around forward selection of variable subset; the model starts with a single 
variable and then starts adding more variables by measuring how much improvement 
the new variable brings into the model. When addition of a variable doesn’t bring any 
improvement in the model, we stop. This way, we can search model subset space to find 
the best subset. 


5.1.3 Embedded Methods 


Embedded methods are improved versions of wrapper algorithms. They introduce a 
penalty factor to the evaluation criteria of the model to bias the model toward lower 
complexity. The algorithms try to balance between the complexity and accuracy of the 
model. Regularization is the most common embedded method for variable 

subset selection, e.g., L1 and L2 regularizations, ridge regression, etc. LASSO stands 
for least absolute shrinkage and selection operator; it will be discussed later 

in this chapter. 
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5.2 Understanding the Working Data 


The data used in this chapter is credit risk data from a public competition. Credit risk 
modeling is one of the most involved modeling problems in the banking industry. The 
process of building a credit risk model is not only complicated in terms of data but also 
requires in-depth knowledge of business and market dynamics. 


A credit risk is the risk of default on a debt that may arise from a borrower 
failing to make required payments. 


A little more background on key terms from credit risk modeling will be helpful for 
you to relate these data problems to other similar domain problems. We briefly introduce 
few key concepts in credit risk modeling. 


Willingness to pay and ability to pay: The credit risk model tries 
to quantify these two aspects of any borrower. Ability to pay 

can be quantified by studying the financial conditions of the 
borrower (variable like income, wealth, etc.), while the tough part 
is measuring willingness to pay, where we use a variable which 
captures behavioral properties (variables like default history, 
fraudulent activities, etc.). 


Probability of default (PD): PD is a measure that indicates how 
likely the borrower is going to default in the next period. The 
higher the value, the higher the chances of default. It is a measure 
having value between 0 and 1 (boundary inclusive). Banks want 
to lend money to borrowers having a low PD. 


Loss Given Default (LGD): LGD is a measure of how much the 
lender is likely to lose if the borrower defaults in the next period. 
Generally, lenders have some kind of collateral with them to 

limit downside risk of default. In simplistic term, this measure is 
the amount lent minus the value of the collateral. This is usually 
measured as a percentage. Borrowers having high LGDs are a risk. 


Exposure at Default (EAD): EAD is the amount that the bank/ 
lender is exposed at the current point in time. This is the amount 
that the lender is likely to lose if the borrower defaults right now. 
This is one of the closely watched metrics in any bank credit risk 
division. 


These terms will help you think through how we can influence the information 
from same data with a tweaked way to do feature engineering. All these metrics can be 
predicted from the same loan default data, but the way we go about selecting features 


will differ. 
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9.2.1 Data Summary 


Data summary of the input data will provide vital information about the data. For this 
chapter, we need to understand some of the features of the data before we apply different 
techniques. To show how to apply statistical methods, select the feature set for modeling. 
The important features that we will be looking are as follows: 


e Properties of dependent variable 
e Feature availability: continuous or categorical 


e Setting up data assumptions 


5.2.2 Properties of Dependent Variable 


In our dataset, loss is the variable used as the dependent variable in the figures in this 
chapter. The modeling is to be done for credit loss. Loan is a type of credit and we will use 
credit loss and loan loss interchangeably. The loss variable has values between 0 and 
100. We will see the loss variable’s distribution in this chapter. 

The following code loads the data and shows the dimension of the dataset created. 
Dimension means the number of records multiplied by number of features. 


Input the data and store in data table 
library(data.table) 


data <-fread ("Dataset/Loan Default Prediction.csv",header=T, verbose 
=FALSE, showProgress =TRUE) 


Read 105471 rows and 771 (of 771) columns from 0.476 GB file in 00:01:02 
dim(data) 
[1] 105471 771 


There are 105,471 records with 771 attributes. Out of 771, there is one dependent series 
and one primary key. We have 769 features to create a feature set for this credit loss model. 

We know that the dependent variable is loss on a scale of 0 to 100. For analysis 
purposes, we will analyze the dependent variable as continuous and discrete. As a 
continuous variable, we will look at descriptive statistics and, as a discrete variable, we 
will look at the distribution. 


#Summary of the data 
summary(data$loss) 
Min. 1st Qu. Median Mean 3rd Qu. Max. 
0.0000 0.0000 0.0000 0.7996 0.0000 100.0000 
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The highest loss that is recorded is 100, which is equivalent to saying that all the 
outstanding credit on the loan was lost. The mean is close to 0 and the first and third 
quartiles are 0. Certainly the loss cannot be dealt with as a continuous variable, as most of the 
values are concentrated toward 0. In other words, the number of cases with default is low. 


hist(data$loss, 

main="Histogram for Loss Distribution ", 
xlab="Loss", 

border="blue", 


col="red", 
las=0, 
breaks=100, 
prob =TRUE) 
Histogram for Loss Distribution 
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Figure 5-1. Distribution of loss (including no default) 


The distribution of loss in Figure 5-1 shows that loss is equal to zero for most part 
of the distribution. We can see that using loss as a continuous variable is not possible 
in this setting. So we will convert our dependent variable into a dichotomous variable, 
with 0 representing a non-default and 1 a default. The problem is to reduce to the default 
prediction, and we now know what kind of machine learning algorithm we intend to 
use down the line. This prior information will help us choose the appropriate feature 
selection methods and metrics to use in feature selection. 

Let's now see for the cases where there is default (i.e., loss not equal to zero), how the 
loss is distributed (recall the LGD measure). 
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#Sub-set the data into NON Loss and Loss ( e.g., loss > 0) 
subset_loss <-subset(data,loss !=0) 
#Distribution of cases where there is some loss registered 


hist(subset_loss$loss, 

main="Histogram for Loss Distribution ( Only Default cases) ", 
xlab="Loss", 

border="blue", 

col="red", 

las=0, 

breaks=100, 

prob =TRUE) 


Below distribution plot exclude non-default cases, in other words for only 
use cases where Loss >0. 


Histogram for Loss Distribution (Default cases) 
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Figure 5-2. Distribution of loss (excluding no default) 
In more than 90% of the cases, we have a loss below 25%, hence the Loss Given 
Default (LGD) is low (see Figure 5-2). The company can recover a high amount of due. 


For further discussion around feature selection, we will create a dichotomous variable 
called default, which will be 0 if the loss is equal to 0 and 1 otherwise. 


default = 0 , there is no default and hence no lossdefault = 1, there is a 
default 
#Create the default variable 


data[,default :=ifelse(data$loss ==0, 0,1)] 
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#Distribution of defaults 


table(data$default) 
0 1 
95688 9783 


#Event rate is defined as ratio of default cases in total population 
print(table(data$default)*100/nrow(data)) 


0 1 
90.724465 9.275535 


So we have converted our dependent variable into a dichotomous variable and 
our features selection problem will be geared toward finding the best set of features to 
model this default behavior for our data. The distribution table states that we have 9.3% 
of the cases of default in our dataset. This is sometime called an event rate in the model 
development data. 


5.2.3 Features Availability: Continuous or Categorical 


The data has 769 features to create a model for the credit loss. We have to identify how 
many of these features are continuous and categorical. This will allow us to design the 
feature selection process appropriately, as many metrics are not directly comparable 
for ordering, e.g., correlation of the continuous variable is different than the correlation 
measure for categorical variables. 


Tip If you don't have any prior Knowledge of a feature’s valid values, you can treat 
variables with more than 30 levels as continuous and ones with fewer than 30 levels as 
categorical variables. 


The following code snippet does three things to identify the type of treatment a 
variable needs to be given, i.e., continuous or categorical: 


e Remove the id,loss, and default indicators from this analysis, as 
these variables are identifier or dependent variable. 


e Find the unique values in each feature; if the number of 
unique values is less than or equal to 30, assign that feature to 
categorical set. 


e Ifthe number of unique values is greater than 30, assign it to be 
continuous. 


189 


CHAPTER 5 ™ FEATURE ENGINEERING 


This idea is working for us; however, you have be cautious about variables like ZIP 
code (it is a nominal variable), states (number of states can be more than 30 and they are 
characters), and other features having character values. 


continuous <-character() 

categorical <-character() 

#Write a loop to go over all features and find unique values 
p<-1 

q<-1 

for (i in names(data) ) 


unique levels =length(unique(data[ ,get(i) | )) 
if(i %inke("id", "loss", "default" ) ) 


next; 


} 


else 


{ 


if (unique levels <=30 |is.character(data[,get(i)])) 


# cat("The feature ", i, " is a categorical variable") 
categorical[p] <-i 
p=p+1 

# Making the 
data[[i]] <-factor(data[[i]]) 


else 


# cat("The feature ", i, " is a continuous variable") 
continuous[q] <-i 


q=qt1 


} 
} 
} 


# subtract 1 as one is dependent variable = default 
cat("\nTotal number of continuous variables in feature set ", 
length(continuous) -1) 


Total number of continuous variables in feature set 717 

# subtract 2 as one is loss and one is id 

cat("\nTotal number of categorical variable in feature set ", 
length(categorical) -2) 


Total number of categorical variable in feature set 49 
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These iterations have divided the data into categorical and continuous variables with 
each having 49 and 717 features in them, respectively. We will ignore the domain specific 
meaning of these features, as our focus is on statistical aspects of feature selection. 


5.2.4 Setting Up Data Assumptions 
To explain the different aspects of feature selection, we will be using some assumptions: 


e We donot have any prior knowledge of feature importance or 
domain-specific restrictions. 


e The machine/model we want to create will predict the 
dichotomous variable default. 


e The order of steps is just for illustration; multiple variations 
do exist. 


5.3 Feature Ranking 


Feature ranking is one of the most popular methods of identifying the explanatory power 
of a feature against the set purpose of the model. In our case the purpose is to predict a 0 
or 1. The explanatory power has to be captured in a predefined metric, so we can put the 
features in an ordinal manner. 

In our problem setup, we can use following steps to get feature rankings: 


e For each feature fit, use a logistic model (a more elaborate 
treatment of this topic is covered in Chapter 6) with dependent 
variable being default. 


e Calculate the Gini coefficient. Here, the Gini coefficient is the 
metric we defined to measure the explanatory power of the 
feature. 


e Rank order features using the Gini coefficient, where the higher 
Gini coefficient means greater explanatory power of the feature. 


Package "MLmetrics" 


This is a collection of evaluation metrics, including loss, score, and utility functions, 
that measure regression, classification, and ranking performance. This is a useful package 
for calculating classifiers performance metrics. We will be using the function Gini() in 
this package to get the Gini coefficient. 

The following code snippet implement the following steps: 


1. For each feature in data, fit a logistic regression using the 
logit link function. 


2. Calculate the Gini coefficient on all the data (you can also 
train on taraining data and calculate Gini on testing data). 


3. Order all the features by the Gini coefficient (higher to lower). 
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library(MLmetrics) 
performance metric gini <-data.frame(feature =character(), Gini value =numeric() ) 


#Write a loop to go over all features and find unique values 
for (feature in names(data) ) 


if(feature %in%e("id", "loss", "default" )) 


next; 


} 


else 


{ 
tryCatch({glm model <-glm(default ~get(feature),data=data,family=binomial(1 
ink="logit")); 


predicted values <-predict.glm(glm model, newdata=data, type="response" ) ; 
Gini value <-Gini(predicted values,data$default) ; 


performance metric gini <-rbind(performance metric_ 
gini,cbind(feature,Gini value) );},error=function(e){}) 


} 
} 


performance metric_gini$Gini_value <-as.numeric(as.character(performance_ 
metric_gini$Gini_value)) 
#Rank the features by value of Gini Coefficient 


Ranked Features <-performance metric_gini[order(-performance metric_ 
gini$Gini_ value), ] 


print("Top 5 Features by Gini Coefficients\n") 
[1] "Top 5 Features by Gini Coefficients\n" 
head(Ranked Features) 
feature Gini_value 
710 #766 0.2689079 


389 F404 0.2688113 
584 F629 0.2521622 
585 F630 0.2506394 
269 #281 0.2503371 
310 #322 0.2447725 


Tip When you are running loops over large datasets, it is possible that the loop might 
Stop due to some errors. To escape that, consider using the trycatch() function in R. 
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The ranking methods tells us that the top six features by their individual predicted 
power are [766, £404, £629, £630, £281, and {322. The top feature in the Gini coefficient is 
0.268 (or 26.8%). Now using the set of top five features, let’s create a logistical model and 
see the same performance metric. 

The following code uses the top six features to fit a logistical model on our data. After 
fitting the model, it them print out the Gini coefficient of the model. 


#Create a logistic model with top 6 features (£766, £404, f629, £630, 281 and f322) 


glm model <-glm(default ~f766 +f404 +629 +f630 +281 +f322,data=data, family 
=binomial (link="logit")); 


predicted values <-predict.glm(glm model, newdata=data, type="response") ; 
Gini value <-Gini(predicted values, data$default) ; 
summary (glm model) 
Call: 
glm(formula = default ~ £766 + £404 + £629 + £630 + £281 + £322, 
family = binomial(link = "logit"), data = data) 


Deviance Residuals: 


Min 10 Median 30 Max 
-0.7056 -0.4932 -0.4065 -0.3242 3.3407 
Coefficients: 

Estimate Std. Error z value Pr(>|z]|) 

(Intercept) -3.071639 2.160885 -1.421 0.155 
f766 -1.609598 2.150991 -0.748 0.454 
f404 0.351095 2.147072 0.164 0.870 
f629 -0.505835 0.077767 -6.505 7.79e-11 *** 
F630 -0.090988 0.057619 -1.579 0.114 
#281 -0.004073 0.008245 -0.494 0.621 
#322 0.262128 0.055992 4.682 2.85e-06 *** 


Signif. codes: 0 ‘***' 0.001 ‘**' 0.01 ‘*' 0.05 '.' 0.1 °° 1 
(Dispersion parameter for binomial family taken to be 1) 


Null deviance: 65044 on 105147 degrees of freedom 
Residual deviance: 62855 on 105141 degrees of freedom 


(323 observations deleted due to missingness) 


AIC: 62869 
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Number of Fisher Scoring iterations: 6 
Gini_value 
[1] 0.2824955 


The model result shows that four features (766, £404, f630, and f281) are not 
significant. The standard errors are very high for these features. This gives us an 
indication that the features themselves are highly correlated and hence are not adding 
value by being in the model. As you can see, the Gini coefficient has not improved, even 
after adding more variables. The reason for the top features being insignificant could be 
that all of them are highly correlated. To investigate this multi-correlated issue, we will 
create the correlation matrix for the six features. 


#Create the correlation matrix for 6 features (766, £404, f629, f630,£281 and £322) 


top 6 feature <-data.frame(data$f766, data$f404, data$f629 , data$f630, data$f28 
1, data$¥322) 


cor(top 6 feature, use="complete") 
data.f766 data.f404 data.f629 data.f630 data.f281 
data.f766 1.0000000 0.9996710 0.6830923 0.64202380.8067094 
data.f404 0.9996710 1.0000000 0.6827368 0.6416069 0.8065005 
data.f629 0.6830923 0.6827368 1.0000000 0.9114775 0.6515478 
data.f630 0.6420238 0.6416069 0.9114775 1.0000000 0.6102867 
data.f281 0.8067094 0.8065005 0.6515478 0.6102867 1.0000000 
data.f322 -0.7675846 -0.7675819 -0.5536863 -0.5127184 -0.7280321 
data. 322 
data.f766 -0.7675846 
data. f404 -0.7675819 
data. f629 -0.5536863 
data. 630 -0.5127184 
data.f281 -0.7280321 
data.f322 1.0000000 


It’s clear from the correlation structure that the features f766, £404, f630, and f281 
are highly correlated and hence the model results shows them to be insignificant. 
This exercise shows that while feature ranking helps in measuring and quantifying 
the individual power of variables, it might not be directly used as a method of variable 
selection for model development. 
Guyon and Elisseeff provide the following criticism for this variable ranking method: 


[The] variable ranking method leads to the selection of a redundant 
subset. The same performance could possibly be achieved with a smaller 
subset of complementary variable. 


You can verify this fact by looking at the correlation matrix and the significant 


variables in the model. The two significant variables are complementary and provide the 
similar Gini coefficient. 
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5.4 Variable Subset Selection 


Variable subset selection is the process of selecting a subset of features (or variables) 

to use in the machine learning model. In previous section, we tried to create a subset 

of variables using the individual ranking of variables but observed the limitations of 
feature ranking as a variable selection method. Now we formally introduce the process 
of variable subset selection. We will be discussing one method from each broad category 
and will show an example using the credit loss data. You are encouraged to compare the 
results and assess what method suits your machine learning problem best. 

Isabelle Guyon and Andre Elisseeff provided comprehensive introduction to various 
methods of variable (or feature) selection. They call the criteria for different methods as 
measuring "usefulness" or "relevance" of features to qualify them to be part of the variable 
subset. The three broad methods—filter, wrapper, and embedded—are illustrated with 
our credit loss data. 


5.4.1 Filter Method 


The filter method uses the intrinsic properties of variables, ignoring the machine learning 
method itself. This method is useful for classification problems where each variable adds 
incremental classification power. 


Criterion: Measure feature/feature subset "relevance" 

Search: Order features by individual feature ranking or nested subset of features 
Assessment: Using statistical tests 

Statistical Approaches 


1. Information gain 


2. Chi-square test 

3. Fisher score 

4. Correlation coefficient 

5. Variance threshold 
Results 


1. Relatively more robust against overfitting 
2. Might not select the most "useful" set of features 


For this method we will be showing the variance threshold approach, which 
is based on the basic concept that the variables that have high variability also have 
higher information in them. Variance threshold is a simple baseline approach. In this 
method, we remove all the variables having variance less than a threshold. This method 
automatically removes the variables having zero variance. 
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Note The features in our dataset are not standardized and hence we cannot do direct 
comparison of variances. We will be using the coefficient of variation (CV) to choose the top 
five features for model building. Also the following exercise is shown only for continuous 
features; for categorical variables, use a chi.square test. 


Coefficient of Variance (CoV), also known as relative standard deviation, provides 
a standardized measure of dispersion of a variable. It is defined as the ratio of standard 
deviation to the mean of the variable: 


Here, we calculate the mean and variance of each continuous variable, then we 
take a ratio of them to calculate the Coefficient of Variance (CoV). The features are then 
ordered by decreasing coefficient of variance. 


#Calculate the variance of each individual variable and standardize the 
variance by dividing with mean() 


coefficient of variance <-data.frame(feature =character(), cov =numeric() ) 


#Write a loop to go over all features and calculate variance 
for (feature in names(data) ) 


if(feature %in%e("id", "loss", "default" )) 
next; 


else if(feature %in%continuous ) 
{ 
tryCatch/( 
{cov <-abs(sd(data[[feature]], na.rm =TRUE)/mean(datal[ [feature] ],na. 
rm =TRUE)); 
if(cov !=Inf){ 
coefficient of variance <-rbind(coefficient_of variance, cbind(feature, 
cov));} else {next;}},error=function(e){}) 


} 


else 


{ 


next; 


} 
} 
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coefficient of variance$cov <-as.numeric(as.character(coefficient of_ 
variance$cov) ) 


#Order the list by highest to lowest coefficient of variation 


Ranked Features cov <-coefficient of variance|order(-coefficient_of_ 
variance$cov), | 


print("Top 5 Features by Coefficient of Variance\n") 

[1] "Top 5 Features by Coefficient of Variance\n" 
head(Ranked Features cov) 

feature COV 

295 338 164.46714 

378 £422 140.48973 

667 #724 87.22657 

584 636 78.06823 

715 #775 70.24765 

666 723 46.31984 


The coefficient of variance provided the top six features by order of their CoV values. 
The features that show up in the top six (f338, £422, f724, 636, f775, and f723) are then 
used to fit a binomial logistic model. We calculate the Gini coefficient of the model to 
assess if these variables improve the Gini over individual features, as discussed earlier. 


#Create a logistic model with top 6 features (f338, 422, f724, 636,775 and £723) 


glm model <-glm(default ~f338 +f422 +f724 +f636 +775 +f723,data=data, family 
=binomial (link="logit")); 


predicted values <-predict.glm(glm model, newdata=data, type="response") ; 
Gini_value <-Gini(predicted values, data$default) ; 
summary (glm model) 
Call: 
glm(formula = default ~ £338 + £422 + £724 + £636 + £775 + £723, 
family = binomial(link = "logit"), data = data) 
Deviance Residuals: 
Min 10 Median 30 Max 
-1.0958 -0.4839 -0.4477 -0.4254 2.6363 
Coefficients: 
Estimate Std. Error z value Pr(>|z|) 


(Intercept) -2.206e+00 1.123e-02 -196.426 < 2e-16 *** 
£338 -1.236e-25 2.591e-25 -0.477 0.633 
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#422 1.535e-01 1.373e-02 11.183 < 2e-16 *** 

#724 1.392e+01 9.763e+00 1.426 0.154 

£636 -1.198e-06 2.198e-06 -0.545 0.586 

775 6.412e-02 1.234e-02 5.197 2.03e-07 *** 

#723 -5.181e+00 4.623e+00 -1.121 0.262 

Signif. codes: O '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 


(Dispersion parameter for binomial family taken to be 1) 
Null deviance: 59064 on 90687 degrees of freedom 
Residual deviance: 58898 on 90681 degrees of freedom 
(14783 observations deleted due to missingness) 
AIC: 58912 


Number of Fisher Scoring iterations: 6 
cat("The Gini Coefficient for the fitted model is 
The Gini Coefficient for the fitted model is 


",Gini_ value) ; 
0.1445109 


This method does not show any improvement on the number of significant 
variables among the top six, i.e., only two features are significant—f422 and f775. Also, 
the model’s overall performance is worse, i.e., the Gini coefficient is 0.144 (14.4% only). 
For completeness of analysis purposes, let’s create the correlation matrix for these six 
features. We want to see if the variables are correlated and hence are insignificant. 


#Create the correlation matrix for 6 features (£338, f422,f724, f636,f775 and 
f723) 


top 6 feature <-data.frame(as.double(data$f338),as.double(data$f422) ,as. 
double (data$f724) ,as.double(data$f636) ,as.double(data$f775),as. 
double (data$723) ) 


cor(top 6 feature, use="complete") 
as.double.data.f338. as.double.data.f422. 


as.double. data. £338. 1.000000e+00 0.009542857 
as.double.data. 422. 9.542857e-03 1.000000000 
as.double.data. 724. 4.335480e-02 0.006249059 
as.double.data. f636. -6.708839e-05 0.011116608 
as.double.data. 775. 5 .537591e-03 0.050666549 
as.double.data. 723. 5 .048078e-02 0.005556227 

as.double.data.f724. as.double.data. 636. 
as.double.data. £338. 0.0433548003 -6.708839e-05 
as.double.data.f422. 0.0062490589 1.111661e-02 
as.double.data.724. 1.0000000000 -1.227539e-04 
as.double.data. 636. -0.0001227539 1.000000e+00 
as.double.data. #775. 0.0121451180 -7.070228e-03 
as.double.data. 723. 0.9738147134 -2.157437e-04 
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as.double.data.f775. as.double.data.f723. 


as.double. data. £338. 0.005537591 0.0504807821 
as.double.data. 422. 0.050666549 0.0055562270 
as.double.data.#724. 0.012145118 0.9738147134 
as.double.data. 636. -0.007070228 -0.0002157437 
as.double.data.f775. 1.000000000 0.0190753853 
as.double.data.f723. 0.019075385 1. 0000000000 


You can clearly see that the correlation structure is not dominating the feature set, 
but the individual feature relevance is driving their selection into the modeling subset. 
This is expected as well as we selected the variables based on CoV, which is independent 
of any other variable. 


5.4.2 Wrapper Methods 


Wrapper methods use a search algorithm to search the space of possible feature 
subsets and evaluate each subset by running a model on the subset. Wrappers can be 
computationally expensive and have a risk of overfitting to the model. 


Criterion: Measure feature subset "usefulness" 
Search: Search the space of all feature subsets and select the set with the highest score 
Assessment: Cross validation 
Statistical Approaches 
1. Recursive feature elimination 
2. Sequential feature selection algorithms 
1. Sequential Forward Selection 
2. Sequential Backward Selection 
3. Plus-l Minus-r Selection 
4. Bidirectional Search 
5. Sequential Floating Selection 
3. Genetic algorithm 
Results 
1. Give the most useful features for model building 
2. Can cause overfitting 


We will be discussing sequential methods for illustration purposes. The most 
popular sequential methods are forward and backward selection. A similar variation of 
both combined is called a stepwise method. 
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Steps in a forward variable selection algorithm are as follows: 


1. Choose a model with only one variable, which gives the 
maximum value in your evaluation function. 


2. Add the next variable that improves the evaluation function 
by a maximum value. 


3. Keep repeating Step 2 until there is no more improvement by 
adding a new variable. 


As you can see. this method is computationally intensive and iterative. It’s important 
to start with a set of variables carefully chosen for the problem. Using all the features 
available might not be cost effective. Filter methods can help shorten your list of variables 
to a manageable set for wrapper methods. 

To set up the illustrative example, let’s take a subset of 10 features from the total set 
of features. Let's have the top five continuous variables from our filter method output and 
randomly choose five from the categorical variables. 


#Pull 5 variables we had from highest coefficient of variation (from filter 
method) (£338, £422, 724, £636 and £775) 


predictor set <-¢( "F338", "F422", "F724", "£636", "£775") 


#Randomly Pull 5 variables from categorical variable set ( Reader can apply 
filter method to categorical variable and can choose these 5 variables 
systematically as well) 

set.seed(101) ; 

ind <-sample(1:length(categorical), 5, replace=FALSE) 

p<-1 

for (i in ind) 


predictor set [5+p] <-categorical[i] 
p=p+1 
#Print the set of 10 variables we will be working with 
print(predictor set) 
[1] "F338" "F422" "£724" "£636" "£f775" "f222" "f33" "f309" "f303" "f113 " 
#Replaced f33 by f93 as f33 does not have levels 
predictor_set[7] <- "f93" 


#Print final list of variables 


print(predictor_set) 
[1] "£338" "£422" "f724" "£636" "£775" "£922" "£93" "F309" "£303 " "£113" 


200 


CHAPTER 5 ™ FEATURE ENGINEERING 


We are preparing to predict the probability of someone defaulting in the next one- 
year time period. Our objective is to select the model based on following characteristics: 


e A fewer number of predictors is preferable 
e Penalize a model having a lot of predictors 
e Penalize a model for a bad fit 


To measure these effects, we will be using the Akaike Information Criterion (AIC) 
measure as the evaluation metric. AIC is founded on the information theory; it measures 
the quality of a model relative to other models. While comparing it to other modells, it 
deals with the tradeoff between the goodness of the fit of the model and the complexity 
of the model. Complexity of the model is represented by the number of variables in the 
model, where more variables mean greater complexity. 

In statistics, AIC is defined as: 


AIC = 2k-2In(L) = 2k + Deviance 


where k is the number of parameters (or features). 


Note AIC is a relative measure; hence, it does not tell you anything about the quality of 
the model in the absolute sense. 


To illustrate the feature selection by forward selection, we need to first develop two 
models, one with all features and one with no features: 


e Full model: A model with all the variables included in it. This 
model provides an upper limit on the complexity of model 


e Null model: A model with no variables in it, just an intercept term. 
This model provides a lower limit on the complexity of model. 


Once we have these two models, we can start the feature selection based on the AIC 
measure. These models are important for AIC to use as a measure of model fit, as AIC will 
be measured relative to these extreme cases in the model. Let's first create a full model 
with all the predictors and see its summary (the output is truncated): 


# Create a small modeling dataset with only predictors and dependent variable 
library (data.table) 

data model <-data[, .(id, £338, £422, £724, £636,775, 222, £93, £309, £303, f113,de 
fault), ] 

#make sure to remove the missing cases to resolve errors regarding null values 


data _model<-na.omit(data_ model) 


#Full model uses all the 10 variables 
full model <-glm(default ~f338 +422 +f724 +f636 +775 +f222 +f93 +309 
+f303 +f113,data=data_model, family=binomial (link="logit") ) 
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#Summary of the full model 
summary(full model) 


Call: 

glm(formula = default ~ £338 + £422 + £724 + £636 + £775 + 222 + 
£93 + £309 + £303 + 113, family = binomial(link = "logit"), 
data = data model) 


Deviance Residuals: 
Min 10 Median 30 Max 
-0.9844 -0.4803 -0.4380 -0.4001 2.7606 


Coefficients: 

Estimate Std. Error z value Pr(>|z]|) 
(Intercept) -2.423e+00 3.146e-02 -77.023 < 2e-16 *** 
f338 -1.379e-25 2.876e-25 -0.480 0.631429 


f422 1.369e-01 1.387e-02 9.876 < 2e-16 *** 

f724 3.197e+00 1.485e+00 2.152 0.031405 * 

F636 -9.976e-07 1.851e-06 -0.539 0.589891 

£775 5.965e-02 1.287e-02 4.636 3.55e-06 *** 

saraierrc Output truncated 

Signif. codes: oO '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 '' 1 


(Dispersion parameter for binomial family taken to be 1) 


Null deviance: 58874 on 90287 degrees of freedom 
Residual deviance: 58189 on 90214 degrees of freedom 
AIC: 58337 


Number of Fisher Scoring iterations: 12 


This output shows the summary of the full model build using all 10 variables. Now, let's 
similarly create the null model: 


#Null model uses no variables 
null model <-glm(default ~1 ,data=data_model, family=binomial (link="logit") ) 


#Summary of the full model 
summary(null_ model) 


Call: 


glm(formula = default ~ 1, family = binomial(link = "logit"), 
data = data model) 
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Deviance Residuals: 
Min 10 Median 30 Max 
-0.4601 -0.4601 -0.4601 -0.4601 2.1439 


Coefficients: 
Estimate Std. Error z value Pr(>|z]) 
(Intercept) -2.19241 0.01107 -198 <2e-16 *** 
Signif. codes: 0 '***' 0.001 ‘**' 0.01 ‘*' 0.05 '.'0.1''1 


(Dispersion parameter for binomial family taken to be 1) 


Null deviance: 58874 on 90287 degrees of freedom 
Residual deviance: 58874 on 90287 degrees of freedom 
AIC: 58876 


Number of Fisher Scoring iterations: 4 


At this stage, we have seen the extreme model performance, having all the variables 
in the model and a model without any variables (basically the historical average of 
dependent variable). With these extreme models, we will perform forward selection with 
the null model and start adding variables to it. 

Forward selection will be done in iterations over the variable subset. Observe that 
the base model for the first iteration is the null model with AIC of 58876. Below that is the 
list of variables to choose from to add to the model. 


#summary of forward selection method 
forwards <-step(null model,scope=list(lower=formula(null_ 
model) ,upper=formula(full model)), direction="forward") 
Start: AIC=58876.26 
default ~ 1 


Df Deviance AIC 
#222 7 58522 58538 
#422 1 58743 58747 
#113 7 58769 58785 
303 24 58780 58830 
#775 1 58841 58845 
93 7 58837 58853 
F309 23 58806 58854 
#724 1 58870 58874 
<none> 58874 58876 
+ 636 1 58873 58877 
+ £338 1 58874 58878 


+++ ttt tt 
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Iteration 1: The model added 222 in the model. 


Step: AIC=58538.39 
default ~ 222 


Df Deviance AIC 
f422 1 58405 58423 
#113 7 58461 58491 
F303 24 58434 58498 
#775 1 58495 58513 
f933 7 58486 58516 
f309 23 58462 58524 
f724 1 58518 58536 
<none> 58522 58538 
+ f636 1 58522 58540 
+ f338 1 58522 58540 


++ ttt tt 


Iteration 2: The model added 422 in the model. 


Step: AIC=58422.87 
default ~ £222 + £422 


Df Deviance AIC 
#113 7 58346 58378 
F303 24 58323 58389 
93 7 58370 58402 
#775 1 58383 58403 
F309 23 58353 58417 
#724 1 58401 58421 
<none> 58405 58423 
+ 636 1 58404 58424 
+ £338 1 58404 58424 


$ 
+ 
$ 
P 
+ 
+ 


Iteration 3: The model added f113 in the model. 


Step: AIC=58377.8 
default ~ £222 + 422 + 113 


Df Deviance AIC 
+ £303 24 58265 58345 
+ 775 1 58325 58359 
+ £309 23 58295 58373 
+ #724 1 58342 58376 
<none> 58346 58378 
+ 636 1 58345 58379 
+ £338 58345 58379 
+ f93 7 58338 58384 


me 
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Iteration 4: The model added f303 in the model. 


Step: AIC=58345.04 
default ~ £222 + £422 + £113 + £303 


Df Deviance AIC 
+ 775 1 58245 58327 
+ #724 1 58261 58343 
<none> 58265 58345 
+ 636 1 58264 58346 
+ £338 1 58265 58347 
+ F309 23 58225 58351 
+ 93 7 58257 58351 


Iteration 5: The model added £775 in the model. 


Step: AIC=58326.96 
default ~ 222 + £422 + £113 + £303 + £775 


Df Deviance AIC 
+ #724 1 58241 58325 
<none> 58245 58327 
+ 636 1 58244 58328 
+ £338 1 58244 58328 
+ £309 23 58202 58330 
+ 93 7 58237 58333 


Iteration 6: The model added 724 in the model. 


Step: AIC=58325.08 
default ~ £222 + £422 + £113 + £303 + £775 + £724 


Df Deviance AIC 
<none> 58241 58325 
+ 636 1 58240 58326 
+ £338 1 58240 58326 
+ £309 23 58199 58329 
+ 93 7 58233 58331 


In last iteration, i.e., iteration six, you can see that our model has reached the 
minimal set of variables. The next suggestion is <none>, which means we are better off 
not adding any variables to the model. Now let’s see how our final forward selection 
model looks: 


#Summary of final model with forward selection process 


formula (forwards) 
default ~ £222 + £422 + £113 + £303 + £775 + 724 
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The forward selection method says that the best model with AIC criteria can be 
created with these six features: f222, £422, f113, 803, f775, and f724. Other features 
selection methods, like backward selection, stepwise selection, etc. can be done ina 
similar manner. In the next section, we will be introducing embedded methods that are 
computationally better than wrapper methods. 


5.4.3 Embedded Methods 


Embedded methods are similar to wrapper methods because they also optimize the 
objective function, usually a model of performance evaluation functions. The difference 
with the wrapper method is that an intrinsic model building metric is used during the 
learning of the model. Essentially, this is a search problem but a guided search, and 
hence is computationally less expensive. 


Criterion: Measure feature subset "usefulness" 
Search: Search the space of all feature subsets guided by the learning process 
Assessment: Cross validation 
Statistical Approaches 
1. L1(LASSO) regularization 
2. Decision tree 
3. Forward selection with Gram-Schimdth orthogonalization 
4. Gradient descent methods 
Results 
1. Similar to wrapper but with guided search 
2. Less computationally expensive 
3. Less prone to overfitting 


For this method, we will be showing a regularization technique. In machine learning 
space, regularization is a process of introducing additional information to prevent 
overfitting while searching through the variable subset space. In this section, we will show 
an illustration of L1 regularization for variable selection. 

L1 regularization for variable selection is also called LASSO (Least Absolute 
Shrinkage and Selection Operator). This method was introduced by Robert Tibshirani 
in his famous 1996 paper titled “Regression Shrinkage and Selection via the Lasso,’ 
published in the Journal of the Royal Statistical Society. 

In L1 or LASSO regression, we add a penalty term against the complexity to reduce 
the degree of overfitting or the variance of the model by adding additional bias. So the 
objective function to minimize looks like this: 


regularization cost = cost + regularization penalty 
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In LASSO regularization, the general form is given for the objective function, 


N 


LS 669,00) 


i=l 


The lasso regularized version of the estimator will be the solution to: 


N 


min SS S(x,y ap) subject to ||| <t 
i i=l 


where only fis penalized while «is free to take any allowed value. Adding the 
regularization cost makes our objective function minimize the regularization cost. 

The objective function for the penalized logistic regression uses the negative 
binomial log-likelihood, and is as follows: 


min -| £Sy.(B.+x18)-og{1+ +4|@-a)je%/2+algl | 


(Bo ‚Ê JeR?* N i=l 


Logistic regression is often plagued with degeneracy when p>Np>N and exhibits 
wild behavior even when N is close to p; the elastic-net penalty alleviates these issues and 
regularizes and selects variables as well. Source: https: //web.stanford.edu/~hastie/ 
glmnet/glmnet_beta.html. 

We will run this example on a set of 10 continuous variables in the dataset. 


#Create data frame with dependent and independent variables (Remove NA) 
data model <-na.omit(data) 

y <-as.matrix(data_model$default) 

x <-aS.matrix(subset(data model, select=continuous[ 250: 260] )) 

library ("glmnet") 

We will be using package glmnet() to show the 

#Fit a model with dependent variable of binomial family 

fit =glmnet(x,y, family="binomial") 


#Summary of fit model 


summary (fit) 

Length Class Mode 
ad 44 -none- numeric 
beta 440 dgCMatrix S4 
df 44 -none- numeric 
dim 2 -none- numeric 
lambda 44 -none- numeric 
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dev.ratio 44 -none- numeric 
nulldev 1 -none- numeric 
npasses 1 -none- numeric 
jerr 1 -none- numeric 
offset 1 -none- logical 
classnames 2 -none- character 
call 4 -none- call 

nobs 1 -none- numeric 


Figure 5-3 shows the plot between the fraction of deviance explained by each of 
these 10 variables. 


#Plot the output of glmnet fit model 
plot (fit, xvar="dev", label=TRUE) 


Coefficients 
0.00 0.05 


-0.05 


-0.10 





0.000 0.005 0.010 0.015 


Fraction Deviance Explained 


Figure 5-3. Coefficient and fraction of deviance explained by each feature/variable 


In the plot with 10 variables shown in Figure 5-3, you can see the coefficient of all 
the variables except that #7 and #5 are 0. As the next step, we will cross-validate our fit. 
For logistic regression, we will use cv .glmnet, which has similar arguments and usage in 
Gaussian. For instance, let's use a misclassification error as the criteria for 10-fold cross- 
validation. 
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#Fit a cross validated binomial model 
fit logistic =cv.glmnet(x,y, family="binomial", type.measure ="class") 


#Summary of fitted Cross Validated Linear Model 


summary (fit logistic) 
Length Class Mode 


lambda 43 -none- numeric 
cvm 43 -none- numeric 
cvsd 43 -none- numeric 
cvup 43 -none- numeric 
cvlo 43 -none- numeric 
nzero 43 -none- numeric 
name 1 -none- character 
gimnet.fit 13 lognet list 
lambda.min 1 -none- numeric 
lambda.1se 1 -none- numeric 


The plot in Figure 5-4 is explaining how the missclassification rate changes over our 
set of features brought into the model. The plot shows that the model is pretty bad, as the 
variables we provided perform badly on the data. 


#Plot the results 
plot (fit logistic) 
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Figure 5-4. Misclassification error and log of penalization factor (lambda) 
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For a good model, Figure 5-4 will show an upward trend in the red dots. This is when 
you know what variability you are measuring in your dataset. 

We can now pull the regularization factor from the glmnet () fit model. We pulled out 
the variable coefficient and variable names action. 


#Print the minimum lambda - regularization factor 
print(fit_logistic$lambda.min) 

[1] 0.003140939 
print(fit_logistic$lambda.1se) 

[1] 0.03214848 
#Against the lambda minimum value we can get the coefficients 
param <-coef(fit logistic, s="lambda.min") 


param <-as.data.frame(as.matrix(param) ) 

param$feature< -rownames (param) 

#The list of variables suggested by the embedded method 
param embeded <-param|param$`1`>0, | 


print(param embeded) 
1 feature 


#279 8.990477e-03 #279 
#298 2.275977e-02 #298 
#322 1.856906e-01 #322 
#377 1.654554e-04 $377 
F452 1.326603e-04 F452 
F453 1.137532e-05 F453 
#471 1.548517e+00 471 
F489 1.741923e-02 F489 


The final features suggested by the LASSO method are [279, £298, £322, £377, 
£452, £453, 471, and f489. Feature selection is a very statistically intense topic. You are 
encouraged to read more about the methods and make sure their chosen methodology 
fits the business problem you are trying to solve. In most of the real scenarios, data 
scientists have to design a mixture of techniques to get the desired set of variables for 
machine learning. 


5.5 Dimensionality Reduction 


In recent years, there has been explosion in the amount as well as type of data available 
at the data scientist's disposal. The traditional machine learning algorithms partly break 
down because of the volume of data and mostly because of the number of variables 
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associated with each observation. The dimension of the data is the number of variables 
we have for each observation in our data. 

Higher dimensions mean both opportunity and challenge for machine learning 
algorithms. Higher dimensions can allow you to capture events that can't be observed 
at low dimensions and, at the same time, they make the machine learning problem 
hard to converge. Within the same framework, Richard E. Bellman coined the term 
Curse of Dimensionality, which refers to various phenomena that arise when analyzing 
and organizing data in high-dimensional spaces (often with hundreds or thousands of 
dimensions) that do not occur in low-dimensional settings such as the three-dimensional 
physical space of everyday experiences. 

In machine learning problems, the addition of each feature into the dataset 
exponentially increases the requirement of data points to train the model. The learning 
algorithm needs an enormous amount of data to search the right model in the higher 
dimensional space. With a fixed number of training samples, the predictive power 
reduces as the dimensionality increases, and this is known as the Hughes phenomenon 
(named after Gordon F. Hughes). 

Dimensionality reduction is a process of deriving a set of degrees of freedom that 
can be used to reproduce most of the variability of a dataset. Essentially, you are creating 
new orthogonal features from raw data, which can essentially explain the large part of 
variance in actual features. 

In mathematical terms, the problem we investigate can be stated as follows: given 
the p-dimensional random variable x = (x1, . . . , xp)T, find a lower dimensional 
representation ofit,s = (s1, . . . , sk) Twithk < p, that captures the content in 
the original data, according to some criterion. 

Dimensionality reduction is a process of features extraction rather than a feature 
selection process. Feature extraction is a process of transforming the data in the high- 
dimensional space to a space of fewer dimensions. The data transformation may be linear, 
as in Principal Component Analysis (PCA), but many nonlinear dimensionality reduction 
techniques also exist. For multidimensional data, tensor representation can be used in 
dimensionality reduction through multilinear subspace learning. For example, by use of 
PCA, you can reduce a set of variables into a smaller set of variables (principal components) 
to model with, e.g., rather than using all 100 features in raw form, you can use the top 10 
PCA factors to build the model with similar performance to the actual full model. 

Within scope of this chapter, we will discuss the most popular technique, Principal 
Component Analysis (PCA). PCA is based on the covariance matrix; it is a second 
order method. A covariance matrix is a matrix whose element in the i, j position is the 
covariance between the ith and jth elements of a random vector. The covariance matrix 
plays a key role in financial economics, especially in portfolio theory and its mutual fund 
separation theorem and in the capital asset pricing model. It creates linear mapping for 
data from low dimension to space such that the variance of the data in low-dimensional 
space is maximized. The method is also known by other names, e.g., Singular Value 
Decomposition (SVD), Hoteling transformation, etc. 

For illustration of PCA, we will work with 10 randomly chosen continuous variables 
from our data and create the principal components and check their significance in 
explaining the data. 
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Here are the steps for principal component analysis: 
1. Load the data as a data.frame. 
2. Normalize/scale the data. 


3. Apply the prcomp() function to get the principal components. 


This performs a principal components analysis on the given data matrix and returns 
the results as an object of class prcomp. 


#Take a subset of 10 features 
pca_ data <-data[, . (£381, £408, £495, £529, £549, £539, £579, £634, £706, £743) | 


pca data <-na.omit(pca data) 


head(pca_ data) 

#381 408 495 £529 549 £539 579 +634 £706 
1: 1598409 5 238.58 1921993.90 501.0 552 462.61 0.261 .1296 
2 659959 6 5.98 224932.72 110.0 76 93.77 11.219 .1224 
3: 2036578 13 33.61 192046.42 112.0 137 108.60 16.775 -2215 
4: 536256 4 258.23 232373.41 161.0 116 127.84 1.120 3.2036 
5 
6 


O A A 


> 2264524 26 1.16 52265.58 21.0 29 20.80 17.739 21.0674 
> 5527421 22 38.91 612209.01 375.9 347 317.27 11.522 17.8663 
£743 
1 -21.82 
2 -72.44 
3% -79.48 
4: 18.15 
5: -10559.05 
6 8674.08 


#Normalise the data before applying PCA 
analysis mean=0, and sd=1 
scaled pca data <-seale(pca data) 


head(scaled pca data) 
£381 F408 F495 £529 £549 £539 
] -0.5692025 -0.6724669 1.7551841 0.4825810 0.9085923 0.9507127 
] -0.6549414 -0.6186983 -0.9505976 -0.4712597 -0.5448800 -0.6449880 
] -0.5291705 -0.2423176 -0.6291842 -0.4897436 -0.5374454 -0.4404970 
] -0.6662432 -0.7262356 1.9837680 -0.4670777 -0.3552967 -0.5108955 
] -0.5083448 0.4566750 -1.0066675 -0.5683081 -0.8757215 -0.8025467 
] -0.2102394 0.2416004 -0.5675306 -0.2535894 0.4435555 0.2634886 
£579 F634 706 £743 
1.0324757 -0.30383519 -0.5885608 -0.1716417 
-0.5546476 0.06876713 -0.5890247 -0.1751343 
-0.4908339 0.25768651 -0.2604470 -0.1756200 
-0.4080440 -0.27462681 -0.6482307 -0.1688839 
-0.8686385 0.29046517 0.5028836 -0.8986722 
0.4070758 0.07906997 0.2966099 0.4283437 


Do the decomposition on the scaled series: 


pca results <-prcomp(scaled pca data) 


print(pca results) 
Standard deviations: 
[1] 1.96507747 1.63138621 0.98482612 0.96399979 0.92767640 0.61171578 


CHAPTER 5 


[7] 0.55618915 0.13051700 0.12485945 0.03347933 
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Here is the summary of 10 principal components we get after applying the prcomp() 


function. 


summary(pca results) 
Importance of components: 


Standard deviation 


Standard deviation 


PC1 


PC2 


PC3 


PC4 


PC5 PC6 


1.9651 1.6314 0.98483 0.96400 0.92768 0.61172 
Proportion of Variance 0.3861 0.2661 0.09699 0.09293 0.08606 0.03742 
Cumulative Proportion 0.3861 0.6523 0.74928 0.84221 0.92827 0.96569 

PC7 
0.55619 0.1305 0.12486 0.03348 
Proportion of Variance 0.03093 0.0017 0.00156 0.00011 
Cumulative Proportion 0.99663 0.9983 0.99989 1.00000 


PC8 


PC9 


PC10 
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The plot in Figure 5-5 shows the variance explained by each principal component. 
You can see that the first five principal components will be able to present ~90% of the 


information stored in 10 variables. 


plot(pca results) 


pca_results 


Variances 


t- 


M 


Figure 5-5. Variance explained by principal components 


The plot in Figure 5-6 is a relationship between principal component 1 and principal 
component 2. As we know, the decomposition is orthogonal, and we can see the 
orthogonality in the plot by looking at the 90 degrees between PC1 and PC2. 


#Create the biplot with principle components 
biplot(pca results, col =e("red", "blue")) 
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Figure 5-6. Orthogonality of principal components 1 and 2 


So instead of using 10 variables for machine learning, you can use these top five 
principal components to train the model and still preserve 90% of the information. 
Advantages of principal component analysis include: 


e Reduces the time and storage space required. 


e Remove multi-collinearity and improves the performance of the 
machine learning model. 


e Makes it easier to visualize the data when reduced to very low 
dimensions such as 2D or 3D. 


5.6 Feature Engineering Checklist 


The feature selection checklist is a great source for decision-making steps in the 
variable selection process. The list is sourced from the “An Introduction to Variable and 
Feature Selection” paper by Isabelle Guyon and Andre Elisseeff. For a more in-depth 
understanding, reference the paper. 

Selection problem Checklist: 


1. Do you have domain knowledge? If yes, construct a better set 
of features. 


2. Are your features commensurate? If no, consider normalizing 
them. 
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FEATURE ENGINEERING 


Do you suspect interdependence of features? If yes, expand 
your feature set by constructing conjunctive features or 
products of features, as much as your computer resources 
allow. 


Do you need to prune the input variables (e.g., for cost, speed, 
or data understanding reasons)? If no, construct disjunctive 
features or weighted sums of features (e.g., by clustering or 
matrix factorization). 


Do you need to assess features individually (e.g., to 
understand their influence on the system or because their 
number is so large that you need to do a first filtering)? If yes, 
use a variable ranking method; otherwise, do it anyway to get 
baseline results. 


Do you need a predictor? If no, stop. 


Do you suspect your data is dirty (has a few meaningless 
input patterns and/or noisy outputs or wrong class labels)? 
If yes, detect the outlier examples using the top ranking 
variables obtained in step 5 as representation; check and/or 
discard them. 


Do you know what to try first? If no, use a linear predictor 
and forward selection method with the method as a 
stopping criterion or use the 0-norm Embedded method. 
For comparison, following the ranking of step 5, construct a 
sequence of predictors of the same nature using increasing 
subsets of features. Can you match or improve the 
performance with a smaller subset? If yes, try a non-linear 
predictor with that subset. 


Do you have new ideas, time, computational resources, and 
enough examples? If yes, compare several feature selection 
methods, including your new idea, correlation coefficients, 
backward selection, and embedded methods. Use linear and 
non-linear predictors. Select the best approach with model 
selection. 


Do you want a stable solution (to improve performance and/ 
or understanding)? If yes, subsample your data and redo your 
analysis for several bootstraps. 
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5.7 Summary 


Feature engineering is an integral part of machine learning model development. The 
volume of data can be reduced by applying sampling techniques. Feature selection 
helps reduce the width of the data by selecting the most powerful features. We 
developed understanding of three core methods of variable selection—filter, wrapper, 
and embedded. Toward the end of this chapter, we showed examples of the Principal 
Component Analysis and learned how PCA can reduce dimensionality without losing the 
taste and value. 

The next chapter is core of this book, chapter 6. The chapter will show you how to 
bring your business problems to your IT system and then try solving the problem using 
the R tool. 
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[1] “An Introduction to Variable and Feature Selection,” by 
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Machine Learning Theory 
and Practices 





The world is quickly adapting the use of Machine Learning (ML). Whether its driverless 
cars, the intelligent personal assistant, or machines playing the games like Go and 
Jeopardy against humans, ML is pervasive. The availability and ease of collecting 

data coupled with high computing power has made this field even more conducive 

to researchers and businesses to explore data-driven solutions for some of the most 
challenging problems. This has led to a revolution and outbreak in the number of new 
startups and tools leveraging ML to solve problems in sectors such as healthcare, IT, HR, 
automobiles, manufacturing, and the list is ever expanding. 

The abstraction layer between the complicated machine learning algorithms and 
its implementation has reached an all-time high with the efforts from ML researchers, 
ML engineers, and developers. Today, you don't have to understand the statistics behind 
the ML algorithms to be able to apply them to a real-world dataset, rather just knowing 
how to use a tool is sufficient (which has its pros and cons), which need you to explore 
and clean the data and put it into an appropriate format. Many large enterprises have 
come out with certain APIs that provide analytics-as-a-service with capabilities to build 
predictive models using ML. This does not stop here—companies like Google, Facebook, 
and IBM have already taken the lead to make some of their systems completely open 
source, which means the way Android revolutionized the mobile industry, these ML 
systems are going to do the same for the next generation of fully automated machines. 

So now it remains to see from where the next path-breaking, billion-dollar, disruptive 
idea is going to come. Though all these might sound like a distant dream, it’s fast 
approaching. The past two decades gave us Google, Twitter, WhatsApp, Facebook, and so 
many others in the technology fields. All these billion-dollar enterprises have data and 
use it to make possibilities that we didn't know about few years back. Computer Vision 
to online location maps have changed the way we work in this century. Who would have 
thought that sitting in one place, you could find best route from one place to another, or a 
drone could perform a search and help rescue operation. These possibilities did not exist 
a few decades ago but now they are reality. All these are driven by data and what we have 
been able to learn from that data. The future belongs to enterprises and individuals who 
embrace the power of data. 
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In this chapter, we are going to deep dive into the fascinating and exciting world 
of machine learning, where we have tried to maintain a fine balance between theory 
and tool-centric practical aspects of the subject. As this chapter is the crux of this entire 
book, we will take up some real industry data to illustrate the algorithm and at the same 
time, make you understand how the concepts you learned in previous chapters are 
connected to this chapter. This means, you will now start to see how the ML process 
flow, PEBE, which we proposed in Chapter 1, is going to play a key role. Chapters 2 to 5 
were foundation and prerequisite for effectively and efficiently running a ML algorithm. 
We learned about properties of data, data types, hidden patterns through visualization, 
sampling, and creating best set of features to apply ML algorithm. The chapters after this 
one are more about how to measure the performance of models, improve them, and what 
technology can help you take ML to an actual scalable environment. 

Roughly, statistical learning techniques when used for prediction and forecasting, 
become machine learning techniques. In this chapter, we will briefly touch on the 
statistical background of each algorithm and then show you how to run that in the R 
environment and interpret results. We have devised the following listed, which is a3D 
approach to empower readers to quickly get started with ML and learn on the fly with a 
right blend of theory and practice: 


e Ist-D: The statistical background—We will introduce the 
core formulation/statistical concept behind the ML concept/ 
algorithm. Since the statistical concepts make the discussion 
intense and fairly complicated for beginners and intermediate 
readers, we have designed a much lighter version of these 
concepts and expect the interested readers to refer a more 
detailed literature for the same (we have provided sufficient 
references wherever possible). 


e 2nd-D: Demo in R—Set up the R environment and write R script 
to work with the datasets provided for a real-world problem. This 
approach of quickly getting started with R programming after the 
theoretical foundation on the problem and ML algorithm, has 
been adopted keeping in mind industry professionals who are 
looking for a quick prototyping and researchers who wants to get 
started with practical implementations of ML algorithm. Industry 
professionals might tend to identify the problem statement in 
their domain of work and apply a brute force approach to try all 
possible ML algorithms, whereas researchers tend to master the 
foundational elements and then proceed to the implementation 
side of things. This chapter is suitable for both. 


e 3rd-D: Real-world use case—The dataset we have chosen to 
explain. The ML algorithms are from real scenarios curated for 
the purpose of explaining the concepts. This means our examples 
are built on real data, making sure we emulate a real-world 
scenario, where the model results are not always good. Sometime 
you get good results, sometimes very poor. A few algorithms work 
best, some don't work on the same data. This approach will help 
readers see all algorithms of same type with same denominator 
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to compare and make judicious decisions based on the actual 
problem at hand. We have discouraged the use of examples that 
explain the concepts in an ideal situation, and create a falsehood 
about performance, e.g., rather than choosing a linear regression- 
based example with R-Square (a performance metric) of 90%, we 
presented a real-world case, where it could be even as poor as 
30%, and discuss further how to improve it. This builds a strong 
case on how to approach a ML model-building process in the 
real world. However, wherever required, we have taken up a few 
standard datasets as well from a few popular repositories for easy 
explanation of certain concepts. 


Additionally, we encourage you to consider three sources of additional information 
starting from this chapter (and book). These sources are readily available from the 
Internet and books that exist in hard bound. 


e Statistical concepts: We encourage you to read the first instance 
of the approach/algorithm as it’s illustrated in this chapter and if 
interested, learn the concepts in much detail from the references 
provided to the original literature. 


e R-Package: R is evolving really fast with its global network of 
contributors. In this chapter, we tried to cover the latest packages 
and functions. We encourage you to follow CRAN and other 
reliable resources like vignettes of R Packages, lecture notes, 
use cases, research papers, etc. to keep up-to-date with R 
implementation and its latest development. 


e Case study: There will be some concepts that are specific to your 
problem/industry. Try to connect the discussions provided in 
the chapter (mostly generic) with your own industry or field of 
expertise. For example, when we predict the “choice” for what 
product a customer will choose from a basket of products, this 
fits in retail setup, but you can think about the same case as 
predicting the “default/non-default” in banking or predicting the 
“Infected/not-Infected” in medical diagnosis, and so on. Keep 
reading the latest reports and industry use cases, as they will 
provide new ideas for how to use the techniques discussed in the 
chapter. 


In the rest of the chapter, we will discuss machine learning processes, discuss the 
real-world use case and then demonstrate the application of ML on this use case. We 
have very broadly divided the ML algorithms into thirteen groups in section 6.2 and 
discuss some selective algorithms from each module in this and coming chapters. Some 
of these modules are touched on in previous chapters as well, where we felt it was more 
relevant. Normally, other books on the subject would have dedicated the entire book to 
such groups; however, based on our PEBE framework for machine learning process flow, 
we have consolidated all the ML algorithms into one chapter, providing you, a much 
needed comprehensive guide for ML. 
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6.1 Machine Learning Types 


In the machine learning literature, there are multiple ways in which we bucket the algorithms 
to study them in a collective manner. The most popular division is based on two factors: 


e Learning types: This is to do with what type of response variable 
(or labels) we have in the training data. In this section, we discuss 
supervised, unsupervised, semi-supervised, and reinforcement 
learning. 


e Subjective grouping: This grouping is driven by “what” the model 
is trying to achieve. Each group has a similar set of algorithmic 
approach and principles. We will show this grouping and few 
popular techniques within them. These similarities help create 
the 12 groups mentioned earlier in the chapter. 


There are lots of overlaps in which ML algorithms are applied to a particular 
problem. As a result, for the same problem, there could be many different ML models 
possible. Chapters 7 and 8 discuss a few ways to choose the best among them and 
combine a few to create a new ensemble. So, coming out with the best ML model is an art 
that requires a lot of patience and trial and error. Figure 6-1 provides a brief of all these 
learning types with sample use cases. 


Machine Learning Types 


oe , Unsupervised Semi-supervised Reinforcement 
Supervised Learning i . : 
Learning Learning Learning 
7 | Categorical | 
Continuous cotegonon ; Target variable not available Categorical Target Variable Target Variobh 
Target Variable Target Voriable i a 


Medi da Customer Market Basket Text 
Imaging | Sepimverrtotion Analysis Classification 


Figure 6-1. Machine learning types 
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6.1.1 Supervised Learning 


A class of ML algorithm where the data contains a response variable (also called a label) 
or it is possible to generate one, is termed supervised learning. In other words, a dataset 
where each instance has correctly identified responses. Further, the response variable 
could be either continuous or categorical. The algorithm learns the response variable 
against the provided set of predictor variables. For example, if the dataset belongs to a set 
of patients, each instance will have a response variable identifying whether a patient has 
cancer or not (categorical). Or in a dataset of house prices in a given state or country, the 
response variable could be the price of the house. 
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Keep in mind that defining the problem (defining a problem will be more clear when 
we discuss the real-world use cases later in the chapter) clearly is important for us to start 
in the right direction. Further, the problem is a classification task if the response variable 
is categorical, and a regression task, if it’s continuous. Though this rule is largely true in 
all cases, there are certain problems that are a mix of both classification and regression. 
Some application of supervised learning are speech recognition, credit scoring, medical 
imaging, and search engines. 


6.1.2 Unsupervised Learning 


On the other hand, when the labels are not available, the class of ML algorithm is called 
unsupervised. The learning happens based on some measure of similarity or distance 
between each row in the dataset. The most commonly used technique in unsupervised 
learning is clustering. Other methods like Association Rule Mining (ARM) are based on 
the frequency of an event like a purchase in market basket, server crashes in log mining, 
and so on. (A lot of literature will argue that ARM is a data mining technique rather than 
machine learning. Refer to Chapter 1 where we presented a detailed argument on the 
differences between statistics, ML, and Data Mining). Some applications of unsupervised 
learning are customer segmentation in marketing, social network analysis, image 
segmentation, climatology, and many more. 


6.1.3 Semi-Supervised Learning 


In the previous two types, either there are no labels for all the observation in the dataset or 
labels are present for all the observations. Semi-supervised learning falls in between these 
two. In many practical situations, the cost to label is quite high, since it requires skilled 
human experts to do that. So, in the absence of labels in the majority of the observations 
but present in few, semi-supervised algorithms are the best candidates for the model 
building. These methods exploit the idea that even though the group memberships of the 
unlabeled data are unknown, this data carries important information about the group 
parameters. The most extensive literature on this topic is provided in the book, Semi- 
Supervised Learning. MIT Press, Cambridge, MA, by Chapelle, O. et. al. [1] Also, there are 
packages like upclass in R, which help build a semi-supervised learning model. 


6.1.4 Reinforcement Learning 


Both supervised and unsupervised learning algorithms need clean and accurate data 

to produce the best results. Also, the data needs to be comprehensive in order to work 
on the unseen instances. For example, if the problem of predicting cancer based on 
patients’ medical history didn’t have data for a particular type of cancer, the algorithm 
will produce many false alarms when deployed in real-time. So, in cases where currently 
the data for learning is not available or it will update rapidly with time, reinforcement 
learning is an ideal choice. The world of robotics and innovation in driverless cars is all 
coming from this class of ML algorithm. Reinforcement learning algorithm (called the 
agent) continuously learns from the environment in an iterative fashion. In the process, 
the agent learns from its experiences of the environment until it explores the full range of 
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possible states. Some applications of the RL algorithm are computer played board games 
(Chess, Go), robotic hands, and self-driving cars. 

A detailed discussion of semi-supervised and RL algorithms is beyond the scope of 
this book; however, we will reference them wherever necessary. 


6.2 Groups of Machine Learning Algorithms 


The ML algorithms are grouped into thirteen modules based on the similarity of 
approach and algorithm output. This will help you create use cases within the same 
module for a more diverse set of problems. 

Another benefit of organizing algorithms in this manner is ease of working with R 
libraries, which are designed to contain all relevant/similar functions in a single library. 
This helps the users explore all options/diagnostics for a problem using a single library. 
The list is ever-expanding with new use cases emerging from academia and industries. 
We will mention which of these algorithms are covered in this book, and let you explore 
more from other sources. 


e Regression-based methods. Regression-based methods are 
the most popular and widely used in academia and research. 
They are easy to explain and easy to put into a live production 
environment. In this class of methods, the relationship between 
dependent variable and set of independent variables is estimated 
by the probabilistic method or by error function minimization. 
We covered regression techniques, linear regression, polynomial 
regression, and logistic regression in this chapter and have 
touched on them in other chapters as well. 


Algorithms 
Ordinary Least Squares Regression (OLSR) 
Linear Regression 


Logistic Regression 


Regression Analysis 


stepwise Regression 
Polynomial Regression 





Locally Estimated Scatterplot Smoothing (LOESS) 


Figure 6-2. Regression algorithms 


e Distance-based algorithms. Distance-based or event-based 
algorithms are used for learning representations of data and 
creating a metric to identify whether an object belongs to the 
class of interest or not. They are sometimes called memory-based 
learning, as they learn from set of instances/events captured in 
the data. We will use K-Nearest Neighbor and Learning Vector 
Quantization in creating ensembles in Chapter 8. 
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k-Nearest Neighbor (kKNN) 
Distance-based Algorithms | Learning Vector Quantization (LVQ) 
Self-Organizing Map (SOM) 





Figure 6-3. Distance-based algorithms 


e Regularization methods. Regularization methods are essentially 
an extension of regression methods. Regularization algorithms 
introduce a penalization term to the loss function (as discussed 
in Chapter 5) for balancing between complexity of model and 
improvement in results. They are very powerful and useful 
techniques when dealing with data with a high number 
of features and large data volume. We had introduced L1 
regularization in Chapter 5 as an embedded method of variable 
subset selection. 


Ridge Regression 


Least Absolute Shrinkage and Selection Operator (LASSO) 


Regularization Algorithms 


Least-Angle Regression (LARS) 





Figure 6-4. Regularization algorithms 


e Tree-based algorithms. These algorithms are based on sequential 
conditional rules applied on the actual data. The rules are 
generally applied serially and a classification decision is made 
when all the conditions are met. These methods are very popular 
in decision-making engines and classification problems. They are 
fast and distributed algorithms. We discuss algorithms like CART, 
Iterative Dichotomizer, CHAID, and C5.0 in this chapter and use 
them to train our ensemble model in Chapter 8. 


Classification and Regression Tree (CART) 

iterative Dichotomiser 3 (ID3) 

C4.5 and C5.0 (different versions of a powerful approach) 
Chi-squared Automatic Interaction Detection (CHAID) 
Random Forest 

Conditional Decision Trees 


Decision Tree Algorithms 





Figure 6-5. Decision tree algorithms 
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e Bayesian Algorithms. These algorithms might not be called 
learning algorithms as they work on the Bayes Theorem based 
on prior and post distributions. The machine essentially does 
not learn from an iterative process but uses inference from 
distributions of variable. These methods are very popular and 
easy to explain, used mostly in classification and inference 
testing. We cover the Naive Bayes model in this chapter, and 
introduce basic ideas from probability to explain them. 


Naive Bayes 


Gaussian Naive Bayes 


Bayesian Algorithms Multinomial Naive Bayes 
Bayesian Belief Network (BBN) 
Bayesian Network (BN) 





Figure 6-6. Bayesian algorithms 


e Clustering Algorithms. These algorithms generally work on 
simple principle of maximization of intracluster similarities 
and minimization of intercluster similarities. The measure of 
similarity determines how the clusters need to be formed. These 
are very useful in marketing and demographic studies. Mostly 
these are unsupervised algorithms, which group the data for 
maximum commonality. We discuss k-means, expectation- 
minimization, and hierarchical clustering. We also discuss the 
distributed clustering. 


k-Means 


k-Medians 





| teri it — a 
Clustering Algorithms Partitioning Around Medoids (PAM) 


Hierarchical Clustering 





Figure 6-7. Clustering algorithms 


e Association Rule Mining. In these algorithms, the relationship 
among the variables is observed and used to quantify the 
relationship for predictive and exploratory objectives. These 
methods have been proved to be very useful to build and 
mine relationships among large multi-dimensional datasets. 
Popular recommendation systems are based on some variation 
of association rule mining algorithms. We discuss Apriori and 
Eclet algorithms in this chapter for association rule mining and 
user and item-based collaborative filtering of recommendation 
algorithm. 
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Aprion algorithm 
Association Rule Eclat algorithm 


Mining Algorithms | FP-growth algorithm 
Context Based Rule Mining 





Figure 6-8. Association rule mining 


e Artificial Neural Networks (ANN). Inspired by the biological 
neural networks, these are powerful enough to learn non-linear 
relationships and recognize higher order relationships among 
variables. They can implement both supervised and unsupervised 
learning process. There is a stark difference between the 
complexity of traditional neural networks and deep learning 
neural networks (discussed later in this chapter). We discuss 
Perceptron and back-propagation in this chapter. 













Artificial Neural Network 
Algorithms 


Radial Basis Function Network (RBFN) 














Figure 6-9. Artificial neural networks 


e Deep Learning. These algorithms work on complex neural 
structures that can abstract higher level of information from a 
huge dataset. They are computationally heavy and hard to train. 
In simple terms, you can think of them as very large, multiple 
hidden layer neural nets. We provide a deep architecture network 
and image recognition (convolutional nets) example in this 
chapter. 


Deep Boltzmann Machine (DBM) 
Deep Belief Networks (DBN) 
Convolutional Neural Network (CNN) 
stacked Auto-Encoders 


Deep Learning Algorithms 





Figure 6-10. Deep learning algorithms 
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e Dimensionality Reduction. These are essentially methods 
for amplifying the signal in data by various transformations 
and supervised learning approaches. These methods are 
usually applied prior to modeling. We had discussed Principal 
Component Analysis (PCA) in Chapter 5. 


Principal Component Analysis (PCA) 
Principal Component Regression (PCR) 


Partial Least Squares Regression (PLSR) 


Dimensionality 


: Multidimensional Scaling (MDS) 
Reduction Algorithms - > 


Linear Discriminant Analysis (LDA) 
Mixture Discriminant Analysis (MDA) 
Quadratic Discriminant Analysis (QDA) 





Figure 6-11. Dimensionality reduction algorithms 


e Ensemble learning. This is a set of algorithms that is built by 
combining results from multiple machine learning algorithms. 
These methods have become very popular due to their ability 
to provide superior results and the possibility of breaking into 
independent models to train on a distributed network. We discuss 
bagging, boosting, stacking, and blending ensembles in Chapter 8. 













Stacked Generalization (blending) 






Ensemble Algorithms 


Gradient Boosting Machines (GBM) 





Figure 6-12. Ensemble learning 


e Text Mining. It also known as text analytics and is a subfield of 
Natural Language Processing, which provides certain algorithms 
and approached to deal with unstructured textual data, 
commonly obtained from call center logs, customer reviews, 
and so on. The algorithms in this group can deal with highly 
unstructured text data to bring insights and/or create features 
for applying machine learning algorithms. We discuss text 
summarization, sentimental analysis, and word cloud, and topic 
identification. 
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Automatic summarization 
Named entity recognition (NER) 
Optical character recognition (OCR) 


Text Mining Part-of-speech tagging 
Sentiment analysis 


Speech recognition 





Topic Modeling 


Figure 6-13. Text mining algorithms 


The list of algorithms discussed has multiple implementations in R, Python, and 
other statistics packages. All the methods don't have readily available R packages for 
implementation. Some algorithms are not fully supported in the R environment and 
have to be used by calling APIs, e.g., text mining and deep neural nets. The research 
community is working toward bringing all the latest algorithms into R either via a package 
or APIs. 

Torsten Hothorn maintains an exhaustive list of packages available in R for 
implementing machine learning algorithms. (Reference: CRAN Task View: Machine 
Learning & Statistical Learning at https://cran.r-project.org/web/views/ 
MachineLearning. html.) 

We recommend you keep an eye on this list and keep following up with the latest 
package releases. In the next section we present a brief taxonomy of all the real-world 
datasets that are going to be used in this chapter and in the coming chapters for demos 
using R. 


6.3 Real-World Datasets 


Throughout this chapter, we are going to use the following set of real-world datasets and 
build many use cases around them in order to demonstrate the various ML algorithms. 
In this section, a brief taxonomy of datasets associated with each use case is presented 
before we start with the demos using R. Apart from these broader datasets, there are 
many smaller datasets being used wherever it was necessary to explain certain concepts. 


6.3.1 House Sale Prices 


The selling price of a house depends on many variables; this dataset presents a 
combination of factors to predict the selling price. Table 6-1 presents the metadata of this 
House Sale Price dataset. 
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Table 6-1. House Sale Price Dataset 


House Sale Prices 


Variable Description Code/Values Variable Name 
Unique Identifier of House | 

1 1to 1201 HOUSE ID 
Property = 

2 Selling price of the house Positive Integer HousePrice 

3 Storage space area in sq.ft Positive Integer StoreArea 

4 Area of the House Basement Positive Integer BasementArea 

5 Area of the Lawn in sq.ft Positive Integer LawnArea 


Width of street connected the NE 
6 Positive Integer StreetHouseFront 
House in feet 


Location of the Property Location Names Location 
8 The type of connectivity to house Types of Road ConnectivityType 
9 The type of building construction Type of building construction BuidlingType 
10 The year house was built Date (Year) ConstructionYear 
11 The type of estate/society SemiPrivate, Government, etc EstateType 
The year in which house sale took 
12 Date (Year) SellingYear 
place 
1- Worst 
Rating of house based on quality , 
13 , ‘a Rating 
and location 
10-Best 
14 Indicator for fresh sale or resale NewHouse, FirstResale, etc SaleType 





6.3.2 Purchase Preference 


This data contains transaction history for customers who have bought a particular 
product. For each customer ID, multiple data points are simulated to capture the 
purchase behavior. The data is originally set for solving multi-class models with four 
possible products from the insurance industry. The features are generic enough so that 
they could be adapted for another industry like automobile, and retail where you could 
have data about the car purchases, consumer goods, and so on. 
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Table 6-2. Purchase Preferences 


| Purchase Preference 


Variable Description Code/Values Variable Name 
1 Unique Identifier of Customer 1 to 1201 CUSTOMER_ID 
2 Choice of Product Purchased 1, 2,3 and 4 ProductChoice 
3 Customer member reward points 1-13 MembershipPoints 
The mode of payment by the 
4 pay y Cash, CreditCard ,etc ModeOfPayment 
customer 
5 Customer resident city City Name ResidentCity 
Numberof months since the first e. 
6 Positive Integer PurchaseTenure 
purchase made by the customer 
7 Channel of purchase Online/Offline Channel 
1- Lowest 
8 Income of customer n incomeClass 
9 - Highest 
Buying propensity rating of the VeryHigh, High, Medium, Low, 
g CustomerPropensity 
customer Unknown 


Age of customer as on last 











10 Positive Integer CustomerAge 
purchase 6 A8 

11 Martial Status of customer 1- Married, 0- Single MartialStatus 

12 Months since the last purchase Positive Integer LastPurchaseDuration 





6.3.3 Twitter Feeds and Article 


We collected some Twitter feeds to generate results for applying text mining algorithms. 
The feeds are taken from National News Channel Twitter accounts as of September 30, 
2016. The handles used are @TimesNow and @CNN. One article available on the Internet 
has been used for summarization. The original article can be found at http://www. 
yourarticlelibrary.com/essay/essay-on-india-after-independence/41354/. 


6.3.4 Breast Cancer 


We will be using the Breast Cancer Wisconsin (Diagnostic) dataset from the UCI machine 
learning repository. The features in the dataset are computed from a digitized image of a 
fine needle aspirate (FNA) of a breast mass. Each variable, except for the first and last, was 
converted into 11 primitive numerical attributes with values ranging from 0 to 10. They 
describe characteristics of the cell nuclei present in the image. Table 6-3 lists the features 
available. 
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Table 6-3. Breast Cancer Wisconsin 


| Breast Cancer Wisconsin 

















Variable Description Code/Values Variable Name 
1 ___—sSa mple code number i 1-699 E Id (V1) l 
2 Clump Thickness 1-10 Cl.thickness (v2) 
3 Uniformity of Cell Size 1-10 Cell.size (V3) 
4 Uniformity of Cell Shape 1-10 Cell.shape (V4) 
5 Marginal Adhesion 1-10 Marg.adhesion (V5) 
6 Single Epithelial Cell Size 1-10 Epith.c.size (V6) 
7 Bare Nuclei 1-10 Bare.nuclei (V7) 
8 Bland Chromatin 1-10 Bl.cromatin (V8) 
9 Normal Nucleoli 1-10 Normal.nucleoli (V9) 
10 Mitoses 1-10 Mitoses (V10) 
11 Class ata Class (V11) 

4 - Malignant) 





6.3.5 Market Basket 


We will use a real-world data from a small supermarket. Each row of this data contains 
a customer transaction with a list of products (from now on, we will use the term items) 
they purchased. Since the items were too many in a typical supermarket, we have 
ageregated them to the category level. For example, “baking needs” covers a number of 
different products like dough, baking soda, butter, and so on. For illustration, let’s take 
a small subset of the data consisting of five transactions and nine items, as shown in 
Table 6-4. 


Table 6-4. Market Basket Data 


| Market Basket Data 





Transaction items 

T1 bread and cake,baking needs,biscuits,canned fruit 

T2 bread and cake,baking needs,Jams spreads,canned vegetables 
T3 bread and cake, frozen foods 

T4 frozen foods, laundry needs, deodorants soap 

T5 Jams spreads,laundry needs 


6.3.6 Amazon Food Review 


The Amazon Fine Food Reviews dataset consists of 568,454 food reviews that Amazon 
users left up to October 2012. A subset of this data is being used for text mining 
approaches in this chapter to show text summarization, categorization, and part-of- 
speech extraction. Table 6-5 contains the metadata of Amazon Fine Food Reviews dataset. 


232 


CHAPTER 6 = MACHINE LEARNING THEORY AND PRACTICES 


Table 6-5. Amazon Food Review 





Amazon Fine Food Reviews 




















Variable Description Code/ Values a Variable Name 
1 id of the reviewer 1- 35173 Id 

2 Unique identifier for the product Alphanumeric Productid 

3 Unique identifier for the user Alphanumeric Userld 

4 ProfileName Alphanumeric ProfileName 


Number of users who found the o. 
Positive Integer HelpfulnessNumerator 


LA 


review helpful 


Number of users who indicated . 
HelpfulnessDenominat 














6 whether they found the review Positive Integer a 
helpful 

E Rating between 1 and 5 1-5 Score 

8 Timestamp for the review Date timestamp Time 

9 Brief summary of the review Character Summary 

10 Text of the review Character Text 


The rest of the chapter will discuss every machine learning algorithm based on 
the grouping discussed earlier and consistently explain every algorithm with our 3D 
approach, discussing statistical background, demonstration in R, and using a real-world 
use Case. 


6.4 Regression Analysis 


In previous chapters, we were trying to set the stage for modeling techniques to work 
for our desired objective. This chapter touches on some of the out-of-box techniques in 
statistical learning and machine learning space. At this stage you might want to focus on 
the algorithmic approaches and not worry much about how statistical assumptions play 
a role in machine learning algorithms. For completeness, we discuss in Chapter 8 how 
statistical learning differs from machine learning. 

The section of regression analysis will focus on building a thought process around 
how the modeling techniques establish and quantify a relation among response variables 
and predictors. We will start by identifying how strong and what type of relationship they 
share and try to see if the relationship can be modeled with an assumption around a 
distribution or not like normal distribution. We will also address some of the important 
diagnostic features of popular techniques and explain what significance they have in 
model selection. 

The focus of these techniques is to find relationships that are statistically significant 
and do not bear any distributional assumptions. The techniques do not establish 
causation (best understood with the notion which says “a strong association is not a proof 
of causation”), but give the data scientist indication of how the data series is related given 
some assumptions around parameters. Causation establishment lies with the prudence 
and business understanding of the process. 


233 


CHAPTER 6 = MACHINE LEARNING THEORY AND PRACTICES 


The concept of causation is important to keep in mind, as most of the time our 
thought process deviates from how relationships quantified by a model have to be 
interpreted. For example, a statistical model will be able to quantify relationships 
between completely irrelevant measures, say electricity generation and beer 
consumption. The linear model will be able to quantify a relationship among them. 

But does beer consumption relate to electricity generation? Or does more electricity 
generation mean more beer consumption? Unless you try very hard, it's difficult to prove. 
Hence, a clear understanding of the process in discussion and domain knowledge is 
important. You have to challenge the assumptions to get the real value out of the data. 
This curse of causation needs to be kept in mind while we discuss correlation and other 
concepts in regression. 

Any regression analysis involves three key sets of variables: 


e Dependent or response variables (Y): Input series 
e Independent or predictor variables (X): Input series 


e Model parameters: Unknown parameters to be estimated by the 
regression model 


For more than one independent variable and single dependent variable these 
quantities can be thought of as a matrix. 

The regression relationship can be shown as a function that maps from set of 
independent variable space to dependent variable space. This relationship is the 
foundation of prediction/forecasting: 


Y ~ f (X,B) 


This notation looks more like a mathematical modeling, and the Statistical Modeling 
scholars use a little different notation for the same relationship: 


E(Y|X)=f(X,B) 


In statistical modeling, regression analysis estimates the conditional expectation of 
dependent variable for known values of independent variables, which is nothing but the 
average value of dependent for given values of independent variables. Other important 
concept to understand before we expand the idea of regression is around parametric 
and non-parametric methods. The discussion in this section will be based on parametric 
methods, while there exist other set of techniques that are non-parametric. 


e Parametric methods assume that the sample data is drawn 
from a known probability distribution based on fixed set of 
parameters. For instance, linear regression assumes normal 
distribution, whereas logistic assumes binomial distribution, 
etc. This assumption allows the methods to be applied to small 
datasets as well. 
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e Non-parametric methods do not assume any probability 
distribution in prior, rather they construct empirical distributions 
from the underlying data. These methods require high volume 
of data to model estimation. There exists a separate branch on 
non-parametric regressions, which is out of scope of this book, 
e.g., kernel regression, Nonparametric Multiplicative Regression 
(NPMR) etc. A good resource to read more on this topic is 
“Artificial Intelligence: A Modern Approach” by Stuart Russell and 
Peter Norvig. [2] 


Further, using a parametric method allows you to easily create confidence intervals 
around the estimated parameters; we will use this in our model diagnostic measures. 
In this book we will be working with two types of input data—continuous input with 
normality assumption and logistic regression with binomial assumption. Also, a small 
primer on generalized framework will be provided for further reading. 


6.5 Correlation Analysis 


The object of statistical science is to discover methods of condensing 
information concerning large groups of allied facts into brief and 
compendious expressions suitable for discussion 


—Sir Francis Galton (1822-1911) 


Correlation can be seen as a broader term used to represent the statistical relationship 
between two variables. Correlation, in principle, provides a single measure of relationship 
among the variables. There are multiple ways in which a relationship can be quantified, 
due to this same reason we have so many types of correlation coefficients in statistics. 

For measuring linear relationships, Pearson correlation is the best measure. Pearson 
correlation, also called the Pearson Product-Moment Correlation Coefficient, is sensitive 
to linear relationships. It also exists for non-linear relationships but doesn’t provide any 
useful information in those cases. 

Let’s assume two random variables, X and Y with their mean as y, and yw, and 
standard deviations o, and o,. The population correlation coefficient is defined as 


E| (X— Hx \(Y -Hy )] 


px,Y= 
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We can infer from this, two important features of this measure: 


e Itranges from -1 (negative correlated) and +1 (positively 
correlated), which can be derived from Cauchy-Schwarz 
inequality. 


e This is defined only when the standard deviation is finite and 
non-zero. 


Similarly, for a sample from the population, the measure is defined as follows: 


n 


E-o) 


i=l 





Let's create some scatter plots with our house price data and see what kind of 
relationship we can quantify using the Pearson correlation. 

Dependent variable: HousePrice 

Independent variable: StoreArea 


Data HousePrice <-read.csw("Dataset/House Sale Price Dataset. 
csv",header=TRUE) ; 


#Create a vectors with Y-Dependent, X-Independent 
y <-Data HousePrice$HousePrice; 
x<-Data_ HousePrice$StoreArea; 


#Scatter Plot 
plot(x,y, main="Scatterplot HousePrice vs StoreArea", 
xlab="StoreArea(sqft)", ylab="HousePrice($)", pch=19,cex=0.3,col="red") 


#Add a fit line to show the relationship direction 
abline(Im(y~x)) # regression line (y~x) 
lines(lowess(x,y), col="green") # lowess line (x,y) 


The plot in Figure 6-14 shows the scatter plot between HousePrice and Store Area. 


The curved line is a locally smoothed fitted line. It can be seen that there is a linear 
relationship among the variables. 
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scatterplot HousePrice vs StoreArea 
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Figure 6-14. Scatter plot between HousePrice and StoreArea 


#Report the correlation coefficient of this relation 
cat("The correlation among HousePrice and StoreArea is ",cor(x,y)); 
The correlation among HousePrice and StoreArea is 0.6212766 


From these plots, we can make the following observations: 


e The relationship is in a positive direction, on average the house 
price increases with the size of the store. This is an intuitive 
relationship, hence we can draw causality. The bigger store space 
means a better house and hence is costly. 


e The correlation is 0.62. This is a moderately strong relationship on 
a linear scale. 


e The curved line is a LOWESS plot (Locally Weighted Scatterplot 
Smoothing), which shows that it is not very different from the 
linear regression line. Hence, the linear relationship is worth 
exploring for a model. 


e Ifyou see closely, there is a vertical line at StoreArea = 0. This 
vertical line is saying that the prices vary for the house where 
there is no store area. We need to look at other factors that are 
driving the house prices. 


We have discussed in detail how to find the set of variables that fit the data best. So in 
coming sections we will not focus on how we got to that model, but show more about how 
to run and interpret them in R. 
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6.5.1 Linear Regression 


Linear regression is a process of fitting a linear predictor function to estimate unknown 
parameters from the underlying data. In general, the model predicts the conditional 
mean of Y given X, which is assumed to be an affine function of X. 


Affine function: as linear regression estimated model does have an 
intercept term and hence it is not just a linear function of X but an affine 


function. 


Essentially, the linear regression model will help you with: 
e Prediction or forecasting 
e Quantifying the relationship among variables 


While the former has to do with if there are some unknown Xs then what is the 
expected value for Y, the later deals with on the historical data of how these variables 


were related in quantifiable terms (e.g., parameters and p-values). 
Mathematically, the simple linear relationship looks like this: 
For a set of n duplets (xi,yi),i=1,...,n , the relationship function is described as: 


Vj O4 Px, ex, 


and the objective of linear regression is to estimate this line: 
y=at+ Bx 


where y is the predicted response (or fitted value) a is the intercept, i.e., average 
response value if the independent variable is zero, / is the parameter for x, i.e., change in 


y by per unit change in x. 
There are many ways to fit this line with the given dataset. This differs with the type 


of loss function we want to minimize, e.g., ordinary least square, least absolute deviation, 
ridge etc. Let’s look at the most popular method of Ordinary Least Square (OLS). 
In OLS, the algorithm minimizes the squares error. The minimization problem can 


be defined as follows: 


Find min, ,Q(a,B), forQ(a,B)=> £; =) (y,-a -bxy 


i=l i=l 
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Being a parametric method, we can have a closed form solution for this 
optimization. The closed form solution for estimating coefficients (or parameters) is by 
using the following equations: 


1 
Xi Yi T 2 Xi i = 


p= i - 
yx -— (di) i 


Where, o° is the variance of x. The derivation of this solution can be found at 
https: //onlinecourses.science.psu.edu/stat414/node/278. 


OLS has special properties for a linear regression model under certain assumptions 
on the residual. Carl Friedrich Gauss and Andrey Markov jointly developed Gauss- 
Markov Theorem that states that if the following conditions are satisfied: 


e Expectation (mean) of residuals is zero (normality of residuals) 
e Residuals are un-correlated and (no auto-correlation) 
e Residuals have equal variance (homoscedasticity) 


then Ordinary Lease Square estimation gives Best Linear Unbiased Estimator of 
coefficients (or parameter estimates). We now explain the key terms that comprise the 
best estimator, i.e., bias, consistent, and efficient. 


6.5.1.2 Best Linear Predictors 


The estimation methodology used in the previous section is Ordinary Least Square (OLS). 
We want to give a statistical definition of three important terms. This is important to make 
sure that even when we use these words loosely, we are aware what these terms mean. 


e Bias of estimator: Bias of an estimator is the difference between 
estimator's expected value and the true value of the parameter 
being estimated. The estimator that has zero bias is desired for 
any model with unbiased estimators. 


Bias,| 6 |=E,| 6 |-0=E,| 6-6 |, 


e This equation tells us that bias is the difference between the 
estimated value of a parameter and the true value. if the true 
value of parameter is 5 and the linear model estimated it to be 
4.5, our estimator is biased by -0.5. This will cause consistent 
underprediction (biased prediction). The theorem says if the 
estimator is unbiased then its bias is equal to 0 for all values of 
parameter 0. 
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e The bias-variance tradeoff is discussed in Chapter 8. 


e Consistent estimator: An estimator Tn of parameter 0 is said to 
be consistent if it converges in probability to the true value of the 
parameter: 


plim T, =8. 


no 


Recall our discussion around CLT and LLN; we can rephrase 
this relationship: 


[yale] pe MEA Jasso |-af-4{ 22] 








O 


As n tends to infinity, the probability of the parameter 
estimate being different from the true value goes to zero. This 
means as we increase the training dataset, we expect that the 
parameter estimator value converges to the true value of the 
parameter. 


e Efficient Estimator: An efficient estimator is an estimator that 
estimates its value in the best manner defined with respect to 
some loss/cost function. Generally, for the OLS framework, we 
say a estimator is efficient if it has bounded variance, i.e., variance 
with an upper limit: 


Var| T] > I," 


where I, is the Fisher information matrix of the model at point 0. 

In regression, we want to minimize the variance (standard error) in our estimations 
of coefficients (parameters), so OLS provides us with an efficient estimator is it satisfies 
the Gauss- Markov theorem. 

These three concepts can be extended to other modeling forms. These are important 
properties to be followed by our estimators to give a statistically significant model. We will 
now show the results for some of the important assumptions/properties we test for linear 
regression models. 
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6.5.2 Simple Linear Regression 


Now we can move on to estimating the linear models using the OLS technique. The 1m() 
package in R provides us with the capability to run OLS for linear regressions. The 1m() 
function can be used to carry out regression, single stratum analysis of variance, and 
analysis of covariance. It is part of the base stats() package in R. 

Now, we will create a simple linear regression and understand how to interpret the 
lm() output for this simple case. Here we are fitting linear regression model with OLS 
technique on the following. 

Dependent variable: HousePrice 

Independent variable: StoreArea 

Further our correlation analysis showed that these two variables have a positive 
linear relation and hence we will expect a positive sign to the parameter estimates of 
StoreArea. Let’s run and interpret the results. 


# fit the model 

fitted Model <-lm(y~x) 

# Display the summary of the model 
summary (fitted Model) 


Call: 
lm(formula = y ~ x) 
Residuals: 
Min 10 Median 30 Max 


-280115 -33717 -4689 24611 490698 


Coefficients: 

Estimate Std. Error t value Pr(>|t]) 
(Intercept) 70677.227 4261.027 16.59 <2e-16 *** 
X 232.723 8.147 28.57 <2e-16 *** 


Signif. codes: 0 ‘***' 0.001 ‘**' 0.01 ‘*' 0.05 '.' 0.1 ' ' 1 


Residual standard error: 63490 on 1298 degrees of freedom 
Multiple R-squared: 0.386, Adjusted R-squared: 0.3855 
F-statistic: 816 on 1 and 1298 DF, p-value: < 2.2e-16 


The estimated equation for our example is 


y = 70677.227 + (232.723) x 


where yis HousePrice and x is StoreArea. This implies for a unit increase in x 
(StoreArea), the y (HousePrice) will be increased by $232.72. Intercept, being constant, 
tells us the HousePrice when there is no StoreArea, and can be thought of as a 
registration fee. 
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Let's discuss the 1m( ) summary output to understand the model better. The same 
explanation can be extended to multiple linear regression in the next section. 


e Call: Output the model equation that was fitted in the 1m() 
function. 


e Residuals: This gives interquartile range of residuals and Min, 
Max, and Median on residuals. A negative median means that 
at least half of the residuals are negative, i.e., the predicted 
values are more than the actual values in more than 50% of the 
prediction. 


e Coefficients: This is a table giving model parameter estimates (or 
coefficient), standard error, t-value, and p-value of the student 
t-test. 


e Diagnostics: Residual standard error, and multiple and adjusted 
R-Square and F-statistics for variance testing. 


It is important to expand the coefficient’s component of the 1m( ) summary output. 
This output provide vital information about the model predictors and their coefficients: 


e Estimate: The fitted value for that parameter. This value 
directly uses the model equation to do prediction and to 
understand the relationship with the dependent variable. For 
example, in our model, the predictor variable x (store area) has 
a coefficient of 232.7. 


e Standard Error (std. Error): The standard deviation of the 
distribution of the parameter estimate. In other words, the 
estimate is the mean value of coefficients and the standard 
deviation of that is the standard error. The standard error can be 
calculated as: 


Where sis the sample standard deviation and n is the size of 
the sample. 


The lower the standard error with respect to the estimate, the better the model 
estimate. 


e t-value and p-value: Student t-test is a hypothesis test that checks 
if the test statistics follow a t-distribution. Statistically the p-value 
reported against each parameter is the p-value of one sample 
t-test. 
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We test that the value of parameter is statistically different from zero. If we fail to 
reject the null hypothesis then we can say the respective parameter is not significant in 
our model. 


e The t-statistics for one sample t-test is as follows: 


_ X= My 


= s/Nn 


t 





where x isthe sample mean, sis the sample standard deviation of the sample, and n 
is the sample size. For our linear model, the t-value of x is 28.57 and the p-value is ~0. This 
means the estimate of x is not 0 and hence it is significant in the model. 

Once we understand how the model looks and what the significance of each 
predictor is, we move on to see how the model fits the actual value. This is done by 
plotting actual values against predicted values: 


res <-stack(data.frame(Observed = y, Predicted =fitted(fitted Model) )) 
res <-cbind(res, x =rep(x, 2)) 


#Plot using lattice xyplot(function) 
library("lattice") 
xyplot(values ~x, data = res, group = ind, auto.key =TRUE) 


The plot shows the fitted values with the actual values. You can see that the plot 
shows the linear relationship predicted by our model, stacked with the scatter plot of the 


original. 
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Figure 6-15. Scatter plot of actual versus predicted 
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Now, this was a model with only one explanatory variable (StoreArea), but there are 
other variables available that show significant relationships with HousePrices. The regression 
framework allow us to add multiple explanatory variables or independent variables to the 
regression analysis. We introduce multiple linear regression in the next section. 


6.5.3 Multiple Linear Regression 


The ideas of simple linear regression can be extended to multiple independent variables. 
The linear relationship in multiple linear regression then becomes 


T : 
Yi = PX +0 + Pp Xp +E, =X; P +E, i=1,...,n, 


For multiple regression, the matrix representation is very popular as it makes the 
concepts of matrix computation explanation easy. 


y=XPrte, 


In our previous example we just used one variable to explain the dependent variable, 
StoreArea. In multiple linear regression we will use StoreArea, StreetHouseFront, 
BasementArea, LawnArea, Rating, and SaleType as independent variables to estimate a 
linear relationship with HousePrice. 

The least square estimation function remains the same except there will be new 
variables as predictors. To run the analysis on multiple variables we introduce one more 
data cleaning step, missing value identification. We either want to impute the missing 
value or leave it out of our analysis. We choose leaving it out by using the na. omit () 
function R. The following code first finds the missing cases and then removes them. 


# Use Im to create a multiple linear regression 

Data lm Model <-Data_ HousePrice[ ,e("HOUSE ID", "HousePrice", "StoreArea","Stre 
etHouseFront", "BasementArea", "LawnArea", "Rating", "SaleType") ]; 

# below function we display number of missing values in each of the 
variables in data 

sapply(Data_ lm Model, function(x) sum(is.na(x)) ) 


HOUSE ID HousePrice StoreArea StreetHouseFront 

0 0 O 231 
BasementArea LawnArea Rating SaleType 
0 0 0 0 


#We have preferred removing the 231 cases which correspond to missing values 
in StreetHouseFront. Na.omit function will remove the missing cases. 
Data lm Model <-na.omit(Data_ 1m Model) 

rownames(Data_ lm Model) <-NULL 

#categorical variables has to be set as factors 
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Data lm Model$Rating <-factor(Data lm Model$Rating) 
Data lm Model$SaleType <-factor(Data 1m Model$SaleType) 


Now we have cleaned up the data from the missing values and can run the 1m() 
function to fit our multiple linear regression model. 
fitted Model multiple <-1m(HousePrice ~StoreArea +StreetHouseFront 
+BasementArea +LawnArea +Rating +SaleType,data=Data_ 1m Model) 


summary(fitted Model multiple) 


Call: 
lm(formula = HousePrice ~ StoreArea + StreetHouseFront + BasementArea + 
LawnArea + Rating + SaleType, data = Data lm Model) 


Residuals: 
Min 10 Median 30 Max 

-485976 -19682 -2244 15690 321737 
Coefficients: 

Estimate Std. Error t value Pr(>|t]) 
(Intercept) 2.507e+04 4.827e+04 0.519 0.60352 
StoreArea 5.462e+01 7.550e+00 7.234 9.06e-13 *** 
StreetHouseFront 1.353e+02 6.042e+01 2.240 0.02529 * 
BasementArea 2.145e+01 3.004e+00 7.140 1.74e-12 *** 
LawnArea 1.026e+00 1.721e-01 5.963 3.39e-09 *** 
Rating2 -8.385e+02 4.816e+04 -0.017 0.98611 
Rating3 2.495e+04 4.302e+04 0.580 0.56198 
Rating4 3.948e+04 4.197e+04 0.940 0.34718 
Rating5 5.576e+04 4.183e+04 1.333 0.18286 
Rating6 7.911e+04 4.186e+04 1.890 0.05905 . 
Rating7 1.187e+05 4.193e+04 2.830 0.00474 ** 
Rating8 1.750e+05 4.214e+04 4.153 3.54e-05 *** 
Rating9 2.482e+05 4.261e+04 5.825 7.61e-09 *** 
Rating10 2.930e+05 4.369e+04 6.708 3.23e-11 *** 
SaleTypeFirstResale 2.146e+04 2.470e+04 0.869 0.38512 
SaleTypeFourthResale 6.725e+03 2.791e+04 0.241 0.80964 
SaleTypeNewHouse 2.329e+03 2.424e+04 0.096 0.92347 
SaleTypeSecondResale -5.524e+03 2.465e+04 -0.224 0.82273 
SaleTypeThirdResale -1.479e+04 2.613e+04 -0.566 0.57160 
Signif. codes: 0 ‘***' 0.001 ‘**' 0.01 ‘*' 0.05 '.' 0.1 °° 1 


Residual standard error: 41660 on 1050 degrees of freedom 
Multiple R-squared: 0.7644, Adjusted R-squared: 0.7604 
F-statistic: 189.3 on 18 and 1050 DF, p-value: < 2.2e-16 
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The estimated model has six independent variables, with four continuous variables 
(StoreArea, StreetHouseFront, BasementArea, and LawnArea) and two categorical 
variables (Rating and SaleType). From the results of lm() function, we can see that 
StoreArea, StreetHouseFront, BasementArea, and Lawn Area are significant at 95% 
confidence level, i.e., statistically different from zero. While all levels of SaleType are 
insignificant, hence statistically they are equal to zero. The higher ratings are significant 
but not the lower ones. The model should drop the SaleType and be re-estimated to keep 
only significant variables. 

Now we will see how the actual versus predicted values look for this model by 
plotting them after ordering the series by house prices. 


#Get the fitted values and create a data frame of actual and predicted get 
predicted values 


actual predicted <-as.data.frame(cbind(as.numeric(Data_ lm Model$HOUSE _ 
ID) ,asenumeric(Data lm Model$HousePrice) ,as.numeric(fitted(fitted Model _ 
multiple) ))) 


names(actual predicted) <-e("HOUSE ID", "Actual", "Predicted") 


#Ordered the house by increasing Actual house price 
actual predicted <-actual_ predicted[order(actual_ predicted$Actual1), ] 


#Find the absolute residual and then take mean of that 
library(ggplot2) 


#Plot Actual vs Predicted values for Test Cases 

ggplot(actual predicted,aes(x =1:nrow(Data_ 1m Model),color=Series)) + 
geom_line(data = actual predicted, aes(x =1:nrow(Data lm Model), y = Actual, 
color ="Actual")) + 

geom_line(data = actual predicted, aes(x =1:nrow(Data lm Model), y = 
Predicted, color ="Predicted")) +xlab('House Number') +ylab('House Sale 
Price’ ) 
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The plot in Figure 6-16 shows the actual and predicted values on a value ordered 
HousePrice. 
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Figure 6-16. The actual versus predicted plot 


We have arranged the HousePrices in increasing order to see less cluttered actual 
versus predicted plot. The plot shows that our model closely follows the actual prices. 
There are few cases of outlier/high values on actual which the model is not able to 
predict, and that is fine as our model is not influenced by outliers. 


6.5.4 Model Diagnostics: Linear Regression 


Model diagnostics is an important step in the model-selection process. There is a 
difference between model performance evaluation, discussed in Chapter 7, and the 
model selection process. In model evaluation, we check how the model performs on 
unseen data (testing data), but in model diagnostic/selection, we see how the model 
fitting itself looks on our data. This includes checking the p-value significance of the 
parameter estimates, normality, auto-correlation, homoscedasticity, influential/outlier 
points, and multicollinearity. There are other test as well to see how well the model 
follows the statistical assumptions, strict exogeneity, anova tables, and others but we will 
focus on only few in the following sections. 
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6.5.4.1 Influential Point Analysis 


In linear regression, extreme values can create issues in the estimation process. Few 
high leverage values introduce bias in the estimators and create other aberrations in the 
residuals. So it is important to identify influential points in data. If the influential points 
seem too extreme, we have to discard them from our analysis as outliers. 

A specific statistical measure that we will show, among others, is Cook’s distance. 
This method is used to find an estimate of the influence data point when doing an OLS 
estimation. 

Cook’s distance is defined as follows: 


e? h. 
D= 7 i >|, 
i en | 


where s’ =(n-— p)- e' e is the mean squared error of the regression model and 








h, =x, (x"x) x, and e=y—y=(I—H)yisdenoted by e, 


In simple terms, Cook's distance measures the effect of deleting a given observation. 
In this way, if removal of some observation causes significant changes, that means those 
points are influencing the regression model. These points are assigned a large value to 
Cook's distance and are considered for further investigation. 

The cutoff value for this statistics can be taken as D, > 4/n, where nis the number of 


observations. If you adjust for the number of parameters in the model, then the cutoff can 
be taken as D, > 4/(n—k—1), where kis the number of variables in the model. 


library (car) ; 
Influential Observations 
# Cook's D plot 
# identify D values > 4/(n-k-1) 
cutoff <-4/((nrow(Data_1m Model)-length(fitted Model_ 
multiple$coefficients )-2) ) 
plot(fitted Model multiple, which=4, cook. levels=cutoff) 
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The plot in Figure 6-17 shows the Cook’s distance for each observation in our 
dataset. 


Cook's distance 
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Figure 6-17. Cook’s distance for each observation 


You can see the observation numbers with a high Cook’s distance are highlighted in 
the plot in Figure 6-17. These observations require further investigation. 


# Influence Plot 

influencePlot (fitted Model multiple, id.method="identify", 
main="Influence Plot", sub="Circle size is proportional to Cook's 
Distance",id. location =FALSE) 
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The plot in Figure 6-18 shows a different view of Cook’s distance. The circle size is 
proportional to the Cook’s distance. 


Influence Plot 





ti 
os Ww 
—! 
D 
gn 
a) n. 
rc 
D 
N ç 
Cc 
Bio 
> OY 
TA) 

U) 

0.0 02 04 0.6 0.8 1.0 
Hat-Values 


Circle size is proportial to Cook's Distance 


Figure 6-18. Influence plot 


Also, the outlier test results are shown here: 


#Outlier Plot 
outlier.test(fitted Model multiple) 
rstudent unadjusted p-value Bonferonni p 


621 -14.285067 1.9651e-42 2.0987e-39 
229 8.259067 4.3857e-16 4.6839e-13 
564 -7.985171 3.6674e-15 3.9168e-12 
1023 7.902970 6.8545e-15 7.3206e-12 
718 5.040489 5 .4665e-07 5 .8382e-04 
799 4.925227 9.7837e-07 1.0449e-03 
235 4.916172 1.0236e-06 1.0932e-03 
487 4.673321 3.3491e-06 3.5768e-03 
530 4.479709 8.2943e-06 8.8583e-03 


The observation numbers—342,621 and 102—as shown in Figure 6-18 
(corresponding to HOUSE_ID 412, 759, and 1242) are the three main influence points. 


Let's pull out these records to see what values they have. 
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#Pull the records with highest leverage 
Debug <-Data_ lm Model[¢(342,621,1023), | 


print(" The observed values for three high leverage points"); 
[1] " The observed values for three high leverage points" 


Debug 
HOUSE ID HousePrice StoreArea StreetHouseFront BasementArea LawnArea 
342 412 375000 513 150 1236 215245 
621 759 160000 1418 313 5644 63887 
1023 1242 745000 813 160 2096 15623 
Rating SaleType 
342 7 NewHouse 
621 10 FirstResale 
1023 10 SecondResale 


print("Model fitted values for these high leverage points"); 
[1] "Model fitted values for these high leverage points" 


fitted Model multiple$fitted.values[¢(342,621,1023) |] 
342 621 1023 
441743.2 645975.9 439634.3 


print("Summary of Observed values"); 
[1] "Summary of Observed values" 


summary (Debug ) 
HOUSE ID HousePrice StoreArea StreetHouseFront 
Min. : 412.0 Min. :160000 Min. : 513.0 Min. :150.0 


1st Qu.: 585.5 1st Qu.:267500 1st Qu.: 663.0 1st Qu.:155.0 
Median : 759.0 Median :375000 Median : 813.0 Median :160.0 
Mean : 804.3 Mean :426667 Mean > 914.7 Mean :207.7 
3rd Qu.:1000.5 3rd Qu.:560000 3rd Qu.:1115.5 3rd Qu.:236.5 
Max. :1242.0 Max. :745000 Max. :1418.0 Max. :313.0 


BasementArea LawnArea Rating SaleType 
Min. :1236 Min. > 15623 10 :2 FifthResale :0 
1st Qu.:1666 1st Qu.: 39755 7 :1 FirstResale :1 
Median :2096 Median : 63887 1 O  FourthResale:0 
Mean :2992 Mean > 98252 2 :0 NewHouse 21 
3rd Qu.:3870 3rd Qu.:139566 3 :0 SecondResale:1 
Max. 75644 Max. 7215245 4 :0 ThirdResale :0 

(Other) :0 


Note that the house price for these three leverage points are far away from the 
mean or high density terms. The house price for two observations corresponds to the 
highest and lowest in the dataset. Also another interesting thing is the third observation 
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corresponding to median house price is having a very high lawn area, certainly an 
influence point. Based on this analysis, we can either go back to check if these are data 
errors or choose to ignore them in our analysis. 


6.5.4.2 Normality of Residuals 


Residuals are core to the diagnostic of regression models. Normality of residual is an 
important condition for the model to be a valid linear regression model. In simple words, 
normality implies that the errors/residuals are random noise and our model has captured 
all the signals in data. 

The linear regression model gives us the conditional expectation of function Y for 
given values of X. However, the fitted equation has some residual to it. We need the 
expectation of residual to be normally distributed with a mean of 0 or reducible to 0. A 
normal residual means that the model inference (confidence interval, model predictors’ 
significance) is valid. 

Distribution of studentized residuals (could be thought of as a normalized value) is 
a good way to see if the normality assumption is holding or not. But we may still want to 
formally test the residuals by normality tests like KS tests, Shapiro-Wilk tests, Anderson 
Darling tests, etc. 

Here, we show the plot of studentized residual for a normal distribution, which 
should follow a bell curve. 


library(stats) 
library (IDPmisc) 
Loading required package: grid 
library (MASS) 
sresid <-studres(fitted Model multiple) 
#Remove irregular values (NAN/Inf/NAs) 
sresid <-NaRV.omit(sresid) 
hist(sresid, freq=FALSE, 
main="Distribution of Studentized Residuals" ,breaks=25) 


xfit<-seq(min(sresid) ,max(sresid) , length=40) 


yfit<-dnorm(xfit) 
lines(xfit, yfit) 
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The plot in Figure 6-19 is created using the studentized residuals. In the previous 
code, the residuals are studentized using the studres() function in R. 


Distribution of Studentized Residuals 
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Figure 6-19. Distribution of studentized residuals 


The residual plot is close to a normal plot as the distribution forms a bell curve. 
However, we still want to do formal testing of the normality. We will show result of all 
three normality test but formally will introduce the test statistics for the most popular test 
of normality—one sample Kolmogorov-Smirnov Test or KS test. For rest of the tests, we 
encourage you to go through the R vignettes for the functions used here. It points to the 
most appropriate reference on the topic. 

Formally, let's introduce the KS test here 

The Kolmogorov-Smirnov statistic for a given cumulative distribution function F(x) is 


F, (x)-F (x) 





D, =sup 


where sup xis the maximum of the set of distances. 
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The KS statistics give back the largest difference between the empirical distribution 
of residual and normal distribution. If the largest (supremum) is more than a critical value 
then we say the distribution is not normal (using the p-value of the test statistic). Here we 
have three tests for conformity of results: 


# test on normality 
#K-S one sample test 
ks.test(fitted Model multiple$residuals, pnorm, alternative="two.sided") 


One-sample Kolmogorov-Smirnov test 


data: fitted Model multiple$residuals 

D = 0.54443, p-value < 2.2e-16 

alternative hypothesis: two-sided 
#Shapiro Wilk Test 
shapiro.test(fitted Model multiple$residuals) 


Shapiro-Wilk normality test 


data: fitted Model multiple$residuals 
W = 0.80444, p-value < 2.2e-16 
#Anderson Darling Test 

library(nortest) 
ad.test(fitted Model multiple$residuals) 


Anderson-Darling normality test 


data: fitted Model multiple$residuals 
A = 29.325, p-value < 2.2¢e-16 


None of these three test thinks that the residuals are distributed normally. The 
p-values are less than 0.05, and hence we can reject the null hypothesis that the 
distribution is normal. This means we have to go back into our model and see what might 
be driving the non-normal behavior, dropping some variable or adding some variable, 
influential points, and other issues. 


6.5.4.3 Multicollinearity 


Multicollinearity is basically a problem of too much information in a pair of independent 
variables. This is a phenomenon when two or more variables are highly correlated, and 
hence causes inflated standard errors in the model fit. For testing this phenomenon, we 
can use the correlation matrix and see if they have a relationship with decent accuracy. 
If yes, the addition of one variable is enough for supplying the information required to 
explain the dependent variable. 

For this section, we will use the variance inflation factor to determine the degree 
of multidisciplinary in the independent variables. Another popular method is the Colin 
index (Condition Index) number to detect multicollinearity. 
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The variance inflation factor (VIF) for multicollinearity is defined as follows: 


1 
tolerance =1- R}, VIF = ——— 
tolerance 


where R? is the coefficient of determination of a regression of explanator j on all the 
other explanators. 

Generally, cutoffs for detecting the presence of multicollinearity based on the 
metrics are: 


e Tolerance less than 0.20 
e VIF of 5 and greater indicating a multicollinearity problem 


The simple solution to this problem is to drop the variable from these thresholds 
from the model building process. 


library (car) 
# calculate the vif factor 
# Evaluate Collinearity 
print(" Variance inflation factors are "); 
[1] " Variance inflation factors are " 
vif(fitted Model multiple); 
# variance inflation factors 
GVIF Df GVIF*(1/(2*Df)) 


StoreArea 1.767064 1 1.329309 
StreetHouseFront 1.359812 1 1.166110 
BasementArea 1.245537 1 1.116036 
LawnArea 1.254520 1 1.120054 
Rating 1.931826 9 1.037259 
SaleType 1.259122 5 1.023309 


print("Tolerance factors are "); 


[1] "Tolerance factors are 
1/wif (fitted Model multiple) 


GVIF Df GVIF*(1/(2*DF) ) 
StoreArea 0.5659106 1.0000000 0.7522703 
StreetHouseFront 0.7353955 1.0000000 0.8575521 
BasementArea 0.8028664 1.0000000 0.8960281 
LawnArea 0.7971175 1.0000000 0.8928143 
Rating 0.5176450 0.1111111 0.9640796 
SaleType 0.7942043 0.2000000 0.9772220 
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Now we have the VIF values and tolerance value in the previous tables. We will 
simply apply the cutoffs for VIF and tolerance as discussed. 


# Apply the cut-off to Vif 
print("Apply the cut-off of 4 for vif") 
[1] "Apply the cut-off of 4 for vif" 
vif(fitted Model multiple) >4 
GVIF Df GVIF*(1/(2*Df)) 


StoreArea FALSE FALSE FALSE 
StreetHouseFront FALSE FALSE FALSE 
BasementArea FALSE FALSE FALSE 
LawnArea FALSE FALSE FALSE 
Rating FALSE TRUE FALSE 
SaleType FALSE TRUE FALSE 


# Apply the cut-off to Tolerance 
print("Apply the cut-off of 0.2 for vif") 
[1] "Apply the cut-off of 0.2 for vif" 
(1/wif (fitted Model multiple)) <0.2 
GVIF Df GVIF4(1/(2*Df)) 


StoreArea FALSE FALSE FALSE 
StreetHouseFront FALSE FALSE FALSE 
BasementArea FALSE FALSE FALSE 
LawnArea FALSE FALSE FALSE 
Rating FALSE TRUE FALSE 
SaleType FALSE FALSE FALSE 


You can observe that GVIF column is false for the cutoffs we set for multicollinearity. 
Hence, we can safely say that our model is not having multicollinearity problem. And 
hence the standard errors are not inflated, so we can do hypothesis testing. 


6.5.4.4 Residual Autocorrelation 


Correlation is defined among two different variables, while autocorrelation, also known 
as serial correlation, is the correlation of a variable with itself at different points in time 
or in a series. This type of relationship is very important and quite frequently used in 
time series modeling. Auto-correlation makes more sense when we have an inherent 
order in the observations, e.g., index by time, key, etc. If the residual shows that it has a 
definite relationship with prior residuals, i.e. auto-correlated, the noise is not purely by 
chance, which means we still have some more information that we can extract and put 
in the model. 

To test for auto-correlation we will use the most popular method, the Durbin 
Watson test. 
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Given the process has defined the mean and variance, the auto-correlation statistics 
of Durbin Watson test can be defined as follows: 


E| (X, — H, )(X, = pi) | 


R(s,t)= -o 
t~ S 


This can be rewritten for our residual auto-correlation as d-Durbin Watson test 
statistics: 


where, et is the residual associated with the observation at time t. 
To interpret the statistics, you can follow these rules: 


Significant No significant autocorelation Significant 
positive No No negative 
autocorrelation decision decision | autocorrelation 





0 d dy 2 4-dy 4 -du 4 


Figure 6-20. Durbin Watson statistics bounds 


Positive auto-correlations mean a positive error for one observation increases the 
chances of a positive error for another observation. While negative auto-correlation is the 
opposite. Both positive and negative auto-correlation are not desired in linear regression 
models. In Figure 6-21, it is clear that if the d-statistics value is close to 2, we can infer 
there if no auto-correlation in residual terms. 

Another way to detect auto-correlation is by plotting the ACF plots and searching for 
spikes. 


# Test for Autocorrelated Errors 
durbinWatsonTest (fitted Model multiple) 
lag Autocorrelation D-W Statistic p-value 
1 -0.03814535 2.076011 0.192 
Alternative hypothesis: rho != 0 
#ACF Plots 
plot(acf(fitted Model multiple$residuals) ) 
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The plot in Figure 6-21 is called an Auto-Correlation Function (ACF) plot against 


different lags. This plot is popular in time series analysis as the data is time index, so we 
are using this plot here as a proxy for an auto-correlation explanation. 


Series fitted_Model_multipleSresiduals 


ACF 
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Figure 6-21. Auto-correlation function (ACF) plot 


The Durbin Watson statistics show no auto-correlation among residuals, with d 
equal to 2.07. Also the ACF plots does not show spikes. Hence, we can say the residuals 
are free from auto-correlation. 


6.5.4.5 Homoscedasticity 


Homoscedasticity means all the random variables in the sequence or vector have 
finite and constant variance. This is also called homogeneity of variance. In the linear 
regression framework, homoscedastic errors/residuals will mean that the variance of 
errors is independent of the values of x. This means the probability distribution of y has 
the same standard deviation regardless of x. 

There are multiple statistical tests for checking the homoscedasticity assumption, 
e.g., the Breush-Pagan test, the arch test, Bartlett's test, and so on. In this section our 
focus is on Bartlett's test, developed in 1989 by Snedecor and Cochran. 

To perform Bartlett’s test, first we create subgroups within our population data. 
For illustration we have created three groups of population data with 400, 400, and 269 
observations in each group. 
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We can create three groups in data to see if the variance varies across 
these three groups 
gp<-numeric() 


for( i in 1:1069) 


if(i<=400){ 
gpli] <-1; 
selse if(i<=800){ 


gpli] <-2; 
selse{ 


gp[i] <-3; 
i 


Now we define the hypothesis we will be testing in Bartlett’s test: 


Ho: All three population variances are the same. 
Ha: At least two are different. 


Here, we perform Bartlett’s test with the function Bartlett.test(): 


Data lm Model$gp <-factor(gp) 
bartlett.test(fitted Model multiple$fitted.values,Data_ 1m Model$gp) 


Bartlett test of homogeneity of variances 


data: fitted Model multiple$fitted.values and Data_lm Model$gp 
Bartlett's K-squared = 1.3052, df = 2, p-value = 0.5207 


The Bartlett test has a p-value of greater than 0.05, which means we fail to reject 
the null hypothesis. The subgroups have the same variance, and hence variance is 
homoscedastic. 

Here, we show some more test for checking variances. This is done for reference 
purpose so that you can replicate other tests if required. 


1. Breush Pagan Test 


# non-constant error variance test - breush pagan test 
ncvTest(fitted Model multiple) 

Non-constant Variance Score Test 

Variance formula: ~ fitted.values 

Chisquare = 2322.866 Df = 1 p = 0 


These results are for a popular test for heteroscedasticity 
called the Breush-Pagan test. The p-value is 0, hence you can 
reject the null that the variance in heteroscedastic. 
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2. ARCH Test 


#also show ARCH test - More relevant for a time series model 
library(FinTS) 
ArchTest(fitted Model multiple$residuals) 


ARCH LM-test; Null hypothesis: no ARCH effects 


data: fitted Model multiple$residuals 
Chi-squared = 4.2168, df = 12, p-value = 0.9792 


The test result for Bartlett test and the Arch test clearly shows 
that the residuals are homoscedastic. The plot in Figure 6-22 
is a residual versus fitted values plot. It is a scatter plot of 
residuals on the x axis and fitted values (estimated responses) 
on the y axis. The plot is used to detect non-linearity, unequal 
error variances, and outliers. 


# plot residuals vs. fitted values 
plot(fitted Model multiple$residuals,fitted Model multiple$fitted. values) 
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Figure 6-22. Residuals versus fitted plot 


A plot of fitted values and residuals also does not show any behavior of increase or 
decrease. This means the residuals are homoscedastic as they don’t vary with values of x. 
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In this section on model diagnostics, we explored the important test and process 
to identify problems with regression. Influential points can bring bias into the model 
estimates and reduce the performance of a model. We can explore few options to reduce 
the problem by capping the values, creating bins and/or may be just remove them for 
analysis. Normality of residuals is important as we will expect a good model to capture 
all signals in the data and reduce the residual to just a white noise. Auto-correlation is a 
feature of indexed data, in this case if the residuals are not independent of each other and 
have auto-correlation then the model performance will be reduced. Homoscedasticity 
is another important diagnostic that tells us if the variance of dependent variable is 
independent of predictor/independent variable. All these diagnostics need to be done 
after fitting a regression model to make sure the model is reliable and statistically valid to 
be used in real settings. 

Now we have tested major tests for linear regression and can now move onto 
polynomial regression. So far we have assumed that the relationship between dependent 
and independent variable was linear, but this may not be the case in real life. Linear 
relations show the same proportional change behavior at all levels. For example, the 
HousePrice increase when the store size changes from 10 sq. ft. to 20 sq. ft. is not the same 
as change of the same 10 sq. ft. from 2000 sq. ft. to 2010 sq. ft. But linear regression ignores 
this fact and assumes the same change at all levels. 

The next section will extend the idea of linear regression to relationships with higher 
degree polynomials. 


6.5.5 Polynomial Regression 


The linear regression framework can be extended to polynomial relationship between 
variables. In polynomial regression, the relationship between independent variable x and 
dependent variable y is modeled as nth degree polynomial. 

The polynomial regression model can be presented as follows: 


Y; =A, +a,X,+a,x; +---+4,,%;" +€,(i=1,2,...,2) 


There are multiple examples where the data does not follow linear dependent but 
higher degrees of relationship. In general, real life relations are not linear in true terms. 
Linear regression assume that the dependent variable can move only one direction with 
the same marginal change per unit independent variable. 

For instance, HousePrice has a positive correlation with StoreArea. This means that if 
the StoreArea increases, the HousePrice will increase. So if StoreArea keeps on increasing 
the HousePrice prices will increase with the same rate (coefficient). But do you believe 
that a HousePrice can go to 1 million if the StoreArea is too big? No, StoreArea has a utility 
that keeps on decreasing as it increases and finally you will not see the same increase in 
HousePrice. 
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Economics provide lot of good examples of such quadratic behavior, e.g., price 
elasticity, diminishing returns, etc. Also in normal planning we make use of quadratic and 
other high level polynomial relationship like discount generation, pricing products, etc. 
We will show an example of how polynomial regression can help model some polynomial 
relationship. 

Dependent variable: Price of a commodity 

Independent variable: Quantity Sold 

The general principle is if the price is too cheap, people will not buy the commodity 
thinking it’s not of good quality, but if the price is too high, people will not buy due to cost 
consideration. Let's try to quantify this relationship using linear and quadratic regression. 


#Dependent variable : Price of a commodity 


y <-as enumeric(¢("3.3","2.8","2.9","2.3", "2.6", "2.1", "2.5", "2.9", "2.4", 
"30" 364 5 228 6 363 a 8 


#Independent variable : Quantity Sold 


X<-as.numeric(¢("50","55", "49", "68", "73"; Ecu "80", "84", "79","92", "91", "90", 
"110", "103", "99")); 


#Plot Linear relationship 


linear reg <-lm(y~x) 


summary(linear reg) 


Call: 
lm(formula = y ~ x) 
Residuals: 
Min 10 Median 30 Max 


-0.66844 -0.25994 0.03346 0.20895 0.69004 


Coefficients: 

Estimate Std. Error t value Pr(>|t]) 
(Intercept) 2.232652 0.445995 5.006 0.00024 *** 
xX 0.007546 0.005463 1.381 0.19046 


Signif. codes: O '***' 0.001 ‘'**' 0.01 '*' 0.05 '." 0.1' '1 


Residual standard error: 0.3836 on 13 degrees of freedom 
Multiple R-squared: 0.128, Adjusted R-squared: 0.06091 
F-statistic: 1.908 on 1 and 13 DF, p-value: 0.1905 


The model summary shows that the multiple R-Square is merely 12% and the 
variable xis insignificant in the model. Also the coefficient of xis insignificant as the 
p-value is 0.19. Figure 6-24 shows the actual versus predicted scatter plot to see whether 
the values are getting fitted well or not. 
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res <-stack(data.frame(Observed =as.numeric(y), Predicted =fitted(linear_ 


reg))) 


res <-cbind(res, x =rep(x, 2)) 


#Plot using lattice xyplot(function) 
library("lattice") 
xyplot(values ~x, data = res, group = ind, auto.key =TRUE) 


Observed © 
Predicted © 


35 


2.5 





Figure 6-23. Actual versus predicted plot linear model 


The plot provides additional proof that the linear relation is not evident from the 
plot. The values are not a right fit in the linear line predicted by the model. 

Now, we move onto fitting a quadratic curve onto our data, to see if that helps us 
capture the curvilinear behavior of quantity by price. 
#Plot Quadratic relationship 
linear reg <-lm(y~x +I(x^2) ) 


summary(linear_reg) 


Call: 
lm(formula = y ~ x + I(x^2)) 
Residuals: 
Min 10 Median 30 Max 


-0.43380 -0.13005 0.00493 0.20701 0.33776 
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Coefficients: 
Estimate Std. Error t value Pr(>|t]) 
(Intercept) 6.8737010 1.1648621 5.901 7.24e-05 *** 


X -0.1189525 0.0309061 -3.849 0.00232 ** 
I(x^2) 0.0008145 0.0001976 4.122 0.00142 ** 
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 


Residual standard error: 0.2569 on 12 degrees of freedom 
Multiple R-squared: 0.6391, Adjusted R-squared: 0.5789 
F-statistic: 10.62 on 2 and 12 DF, p-value: 0.002211 


The model summary shows that the multiple R-Square has improved to 63% after we 
introduce a quadratic term for x, and both variable x and x-square are statistically significant 
in the model. Let’s plot the scatter plot and see if the values fit the data well or not. 


res <-stack(data.frame(Observed =as.numeric(y), Predicted = 
fitted(linear reg))) 
res <-cbind(res, x =rep(x, 2)) 


#Plot using lattice xyplot(function) 
library("lattice") 
xyplot(values ~x, data = res, group = ind, auto.key =TRUE) 


Observed © 
Predicted © 


3.5 


25 





Figure 6-24. Actual versus predicted plot quadratic polynomial model 
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The model shows improvement in R-Square and quadratic term is significant in 
model. The plot also shows a better fit in quadratic case than the linear case. The idea 
can be extended to higher degree polynomials, but that will cause overfitting. Also, many 
processes are normally not well represented by very high degree polynomial. If you are 
planning to use polynomial of degree more than four, try to be very careful during the 
interpretation. 


6.5.6 Logistic Regression 


In linear regression we have seen that the dependent variable is a continuous variable 
having real values. Also, we have determined that the error requires to be normal for 
the regression equation to be valid. Now let’s assume what will happen if the dependent 
variable is a having only two possible values (0 and 1), in other words binomially 
distributed. Then the error terms can not be normally distributed as: 


ei = Binomial(Yi)—Gausssian( 0+ B1xi) 


Hence, we need to move onto different framework to accommodate the cases where 
the dependent variable is not Gaussian but from an exponential family of distributions. 
After logistic regression we will touch on exponential distributions and show how they 
can be reduced to a linear form by a link function. The logistic regression models a 
relationship between predictor variables and a categorical response/dependent variable. 
For instance, the credit risk problem we were looking at in Chapter 5. The predictor 
variables were used to model the binomial outcome of default/No Default. 

Logistic regression can be of three types based on the type of categorical (response) 
variable: 


e Binomial Logistic Regression: Only two possible values for 
response variable(0/1). Typically we estimate the probability of it 
being 1 and based on some cutoff we predict the state of response 
variable. 


Binomial distribution probability mass function is given by 
n k n-k 
f (kin,p)=Pr(X =k)= a (1-p) 


where k is number of successes, n is total number of trials, and p is the unit 
probability of success. 


e Multinomial Logistic Regression: There are three or more 
values/levels for the categorical response variable. Typically, 
we Calculate probability for each level and then, based on some 
classification (e.g., maximum probability), we assign the state of 
response variable. 
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Multinomial Distribution probability mass function is given by 


f(X oX nPop) =Pr(X, =x, and ... and X,,=x,) 


n! X Xk is 
———p >p `, when Sx =n 
=4 xXx ex] 1 k re 
0 otherwise, 


for non-negative integers x,,...,X;, 


where xi is set of predictor variables, pk is probability of each class (proportion), and 
n is number of trials (sample size). 


e Ordered Logistic Regression: The response variable is a 
categorical variable with some built-in order in them. This 
method is the same as multinomial, with key difference of having 
an inherent order in them. For example, a rating variable between 
l and 5. 


Let’s look at two important terms we will use in explaining the logistic regression. 


6.5.7 Logit Transformation 


For logistic regression, we use a transformation function, also called the link function, 
which creates a linear function from binomial distribution in independent variable. The 
link function used for binomial distribution is called the logit function. 

The logit function o(t)o(t) is defined as follows: 


e' 1 
o(t)= 


e441) 1lte 








The logit function curve looks like this: 


curve((1/(1+exp(-x))),-10,10,col =“violet”) 
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Figure 6-25. Logit function 


You can observe that the logit function is capped from top by 1 and from bottom 
by 0. Extremely high values of x have very little effect on function value, the same for very 
small values. This way we can see the bounds are between 0 and 1 probability scale to fit 
a model. 

In logistic regression, we use maximum likelihood estimation (MLE), while for 
multinomial we use iterative method to optimize on the logLoss function. 

The logit function then convert the relationship into logit of odds ratio as a linear 
combination of independent variables. The inverse of the logistic function g, the logit 
(log odds), maps the relationship into a linear one: 


g(F(x)) = nf EEL = B,+ Bx 


1— F(x) 


In this section we discuss logistic regression with binomial categorical variables, 
and in later part we will touch at a high level how to extend this method into a 
multinomial class. 


6.5.8 Odds Ratio 


In Chapter 1, we discussed probability measure, which signifies the chance of having that 
event. The value of probability is always between 0 and 1, where 0 means definitely no 
occurrence of event and 1 being that event definitely happened. 
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We define probability odds, or simply odds, as the ratio of chance of the event 
happening and nothing happening 


Odds in favor of event A = P(A)/1-P(A) 
Odds against event A = (1-P(A))/P(A) = 1/0dds in favor 


So now, an odds of 2 for event A will mean that event A is 2 times more likely of 
happening than not happening. The ratio can be generalized to any number of classes, 
where the interpretation changes to likelihood of an event happening against all possible 
events. 

Odds ratio is a way to represent the relationship between presence/absence of an 
event “A” with the presence or absence of event “B’ For a binary case, we can create an 
example as shown: 


Oddsratio = (OddsofA) / (OddsofB ) 


For example, let’s assume there are two types of event outcome, A and B. Probability 
of event A happening is 0.4 (P(A)) and event B of 0.6(P(B)). Then odds in favor of A is 0.66 
(P(A)/1-P(A)), similarly, odds for B is 1.5 (P(B)/1-P(B)). i.e., chances of event B happening 
is 1.5 time that of not happening. 

Now the odds ratio is defined as a ratio of these odds, odds B by Odds A = 1.5/0.66 = 
2.27 ~ 2. This is saying that chances of B happening are twice that of event A happening. 
We can observe that this quantity is a relative measure, and hence we use concept of base 
levels in logistic regressions. The odds ratio from the model is relative to base level/class. 

Now, we can introduce the relationship between logit and odds ratios. The logistic 
regression essentially model the following equation, which is logit transform on odds of 
event and covert our problem to its linear form as shown: 





logit(E| Y, 





i 


Xi vtn |) lowe, = = ja + BX, +: +B Xn; 


Hence in logistic regression, odds ratio is the exponentiated coefficient of variables, 
signifying the relative chance of event from reference class/event. Here, you can see how 
the odds ratio translates to exponentiated coefficients of logistic regression. 





F(x+1) 
a odds(x +1) 7 1—F(x+1) a ePo* Pix) _ oh 
odds(x) F(x) ers 
1— F(x) 
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6.5.8.1 Binomial Logistic Model 


Let’s use our Purchase Prediction data to build a logistic regression model and see its 
diagnostics. We will be subsetting the data to only have ProductChoice 1 and 3 as 1 and 0 
respectively, in our analysis. 


#Load the data and prepare a dataset for logistic regression 

Data Purchase Prediction <-read.csw("~/Dropbox/Book Writing - Drafts/Chapter 
Drafts/Final Artwork and Code/Chapter 6/Dataset/Purchase Prediction Dataset. 
csv", header=TRUE) ; 


Data Purchase Prediction$choice <-ifelse(Data Purchase 
Prediction$ProductChoice ==1,1, 
ifelse(Data Purchase Prediction$ProductChoice ==3,0,999)); 


Data Logistic <-Data Purchase Prediction|Data Purchase Prediction$choice 
*inwe("0","1"),e("CUSTOMER ID", "choice", "MembershipPoints", "IncomeClass", "Cu 


stomerPropensity", "LastPurchaseDuration") | 
table(Data Logistic$choice, useNA="always") 


0 1 <NA> 
143893 106603 0 
Data _Logistic$MembershipPoints <-factor(Data_Logistic$MembershipPoints ) 
Data_Logistic$IncomeClass <-factor(Data Logistic$IncomeClass) 
Data Logistic$CustomerPropensity <-factor(Data_Logistic$CustomerPropensity) 
Data_Logistic$LastPurchaseDuration <-as.numeric(Data Logistic$LastPurchaseD 
uration) 


Before we start the model, let's see the distribution of categorical variables over 
dependent categorical variables. 


table(Data_Logistic$MembershipPoints,Data Logistic$choice) 


O 1 
1 15516 13649 
2 19486 15424 
3 20919 15661 
4 20198 14944 
5 18868 13728 
6 16710 11883 
7 13635 9381 
8 9632 6432 
9 5566 3512 
10 2427 1446 
11 754 441 
12 165 95 
13 17 7 
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This distribution says, as the MemberShipPoints increase, both choice 0 and 1 
decrease. 


table(Data_Logistic$IncomeClass,Data_Logistic$choice) 


0 1 
1 145 156 
2 203 209 
3 3996 3535 
4 23894 18952 
5 47025 36781 
6 38905 27804 
7 21784 14715 
8 6922 3958 
9 1019 493 


This distribution says, most of the customers are in income classes 4, 5, and 6. The 
choice distribution is equitable in both 0 and 1 across the income class bands. 


table(Data_Logistic$CustomerPropensity,Data_ Logistic$choice) 


0 1 
High 26604 10047 
Low 20291 19962 


Medium 27659 17185 
Unknown 36633 52926 
VeryHigh 32706 6483 


The distribution is interesting as it tells that customers with very high propensity are 
very unlikely to buy the product represented by class 1. The distributions are good way to 
get a first-hand idea of your data. This exploratory task also helps in feature selections for 
models. 

Now, we have all the relevant libraries and function loaded, we will show step by step 
how to develop the logistic regression and choose one of the model for evaluation in the 
next chapter. We will be developing model on full data, and the next chapter will discuss 
performance evaluation metrics in detail. 


library(dplyr) 
#Get the average purchase rate by Rating and plot that 


summarise(group_by(Data Logistic, IncomeClass),Average Rate=mean(choice) ) 


print("Summary of Average Purchase Rate by IncomeClass") 
[1] "Summary of Average Purchase Rate by IncomeClass" 
summary Rating 
# A tibble: 9 x 2 
IncomeClass Average Rate 
<fctr><dbl> 
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1 1 0.5182724 
2 2 0.5072816 
3 3 0.4693932 
4 4 0.4423283 
5 5 0.4388827 
6 6 0.4167953 
7 7 0.4031617 
8 8 0.3637868 
9 9 0.3260582 


plot(summary Rating$IncomeClass,summary Rating$Average Rate,type="b", 
xlab="Income Class", ylab="Average Purchase Rate observed", main="Purchase 
Rate and Income Class") 


Now we want to see how average purchase rate of product 1 varies over the Income 


class. We plot the average purchase rate (proportion of 1) by each income class, as shown 
in Figure 6-27. 


Purchase Rate and Income Class 


Average Purchase Rate observed 
035 040 045 050 


Income Class 


Figure 6-26. Purchase rate and income class 


The plot in Figure 6-26 shows that, as the income class increases the propensity to 
buy the product 1, goes down. Similar plots can be created for other variables to see how 
the expected behavior of model probabilities should be after fitting a model. 

Now we will clean up the data from NA’s (missing values 0) and fit a binary logistic 
regression using the function glm(). GLM stands for generalized linear regression, which 
can handle exponential family of distributions. The function requires users to mention 
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the family of distribution the dependent variable belong to and the link function you want 


to use. We have used the binomial family with logit as a link function. 


#Remove the Missing values - NAs 


Data_Logistic <-na.omit(Data Logistic) 


rownames(Data Logistic) <-NULL 


#Divide the data into Train and Test 


set.seed(917); 


index <-sample(1:nrow(Data_ Logistic) ,round(0.7*nrow(Data Logistic) )) 
train <-Data_Logistic| index, | 
test <-Data Logistic[-index, | 


Fitting a logistic model 


Model logistic <-glm( choice ~MembershipPoints +IncomeClass 
+CustomerPropensity +LastPurchaseDuration, data = train, family 


=binomial(link ='logit')); 
summary(Model logistic) 
Call: 


glm(formula = 
Propensity + 


choice ~ MembershipPoints + IncomeClass + Customer 


LastPurchaseDuration, family = binomial(link = "logit"), 


data = train) 


Deviance Residuals: 


Min 10 Median 30 Max 
-1.631 -1.017 -0.614 1.069 2.223 
Coefficients: 

Estimate Std. Error z value Pr(>|z]|) 

(Intercept) 0.066989 0.145543 0.460 0.645323 
MembershipPoints2 -0.123408 0.020577 -5.997 2.01e-09 *** 
MembershipPoints3 -0.185540 0.020359 -9.113 < 2e-16 *** 
MembershipPoints4 -0.204938 0.020542 -9.977 < 2e-16 *** 
MembershipPoints5 -0.237311 0.020942 -11.332 < 2e-16 *** 
MembershipPoints6 -0.258884 0.021597 -11.987 < 2e-16 *** 
MembershipPoints7 -0.291123 0.022894 -12.716 < 2e-16 *** 
MembershipPoints8 -0.326029 0.025526 -12.773 < 2e-16 *** 
MembershipPoints9 -0.387113 0.031572 -12.261 < 2e-16 *** 
MembershipPoints10 -0.439228 0.044839 -9.796 < 2e-16 *** 
MembershipPoints11 -0.357339 0.078493 -4.553 5.30e-06 *** 
MembershipPoints12 -0.447326 0.164172 -2.725 0.006435 ** 
MembershipPoints13 -1.349163 0.583320 -2.313 0.020728 * 
IncomeClass2 -0.412020 0.190461 -2.163 0.030520 * 
IncomeClass3 -0.342854 0.146938 -2.333 0.019631 * 
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IncomeClass4 -0.389236 0.144433 -2.695 0.007040 ** 
IncomeClass5 -0.373493 0.144169 -2.591 0.009579 ** 
IncomeClass6 -0.442134 0.144244 -3.065 0.002175 ** 
IncomeClass7 -0.455158 0.144548 -3.149 0.001639 ** 
IncomeClass8 -0.509290 0.146126 -3.485 0.000492 *** 
IncomeClass9 -0.569825 0.160174 -3.558 0.000374 *** 
CustomerPropensityLow 0.877850 0.018709 46.921 < 2e-16 *** 
CustomerPropensityMedium 0.427725 0.018491 23.131 < 2e-16 *** 
CustomerPropensityUnknown 1.208693 0.016616 72.744 < 2e-16 *** 
CustomerPropensityVeryHigh -0.601513 0.021652 -27.781 < 2e-16 *** 
LastPurchaseDuration -0.063463 0.001211 -52.393 < 2e-16 *** 
Signif. codes: O '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 


(Dispersion parameter for binomial family taken to be 1) 


Null deviance: 235658 on 172985 degrees of freedom 
Residual deviance: 213864 on 172960 degrees of freedom 
AIC: 213916 


Number of Fisher Scoring iterations: 4 


The p-value of all the variables and levels is significant. This implies we have fit a 
model with variables having significant relationship with dependent variable. Now let’s 
work out to get classification matrix for this model. This is done by method of balancing 
specificity and sensitivity measure. Details of these metrics are given in Chapter 6, and 
we will give a brief explanation here and make use of that to create a good cutoff for 
classification from probabilities into classes. 


#install and load package 

library (pROC) 

#apply roc function 

cut_off <-roc(response=train$choice, predictor=Model logistic$fitted.values) 


#Find threshold that minimizes error 
e <-cbind(cut_off$thresholds, cut_off$sensitivities+cut_off$specificities) 
best _t <-subset(e,e[ ,2]==max(e[,2]))[,1] 


#Plot ROC Curve 

plot(1-cut_off$specificities, cut_off$sensitivities, type="1", 
ylab="Sensitivity",xlab="1-Specificity” ,col="green" , lwd=2, 
main ="ROC Curve for Train") 

abline(a=0, b=1) 


abline(v = best_t) #add optimal t to ROC curve 

The plot in Figure 6-27 is between specificity and sensitivity. The plot is also called 
ROC plot. The best cutoff is the value on the curve that maximizes sensitivity and 
minimizes (1-specificity). 
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cat(" The best value of cut-off for classifier is ", best _t) 
The best value of cut-off for classifier is 0.4202652 


ROC Curve for Train 


1.0 


sensitiviy 


00 02 04 06 08 





1-Specificity 


Figure 6-27. ROC curve for train data 


Looking at the plot, we can see our choice of cutoff will provide best classification 
on the train data. We need to test this assumption on the test data and record the 
classification rate by using this cutoff of 0.42. 


# Predict the probabilities for test and apply the cut-off 
predict prob <-predict(Model logistic, newdata=test, type="response" ) 


#Apply the cutoff to get the class 
class pred <-ifelse(predict prob >0.41,1,0) 


#Classification table 
table(test$choice,class pred) 
class pred 
0 1 
O 24605 18034 
1 8364 23134 
#Classification rate 
sum(diag(table(test$choice,class_pred))/nrow(test)) 
[1] 0.6439295 


The model shows 64% good classification on the test data. This shows the model can 
capture the signals in the data well to distinguish between 0 and 1. 
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The logistic model diagnostic is different from linear regression models. In following 
sections, we explore some common diagnostic metrics for logistic regression. 


6.5.9 Model Diagnostics: Logistic Regression 
Once we have fit the model, we have a two-step analysis to do on the logistic output: 


1. Ifwe are interested in final assignment of class, we focus on 
classifier and compare the exact classes assigned by using 
classifier on the predicted probabilities. 


2. Ifwe are interested in the probabilities, we will look at if the 
cases where the chances of event are high are getting high 
probabilities. 


Other than this, we want to look at the coefficients, R-Square equivalents, and other 
tests to verify that our model has been fit with statistical validity. Another important 
thing to keep in mind while looking coefficients is that the logistic regression coefficients 
represent the change in the logit for each unit change in the predictor, which is not the 
same as linear regression. 

We will show how to perform three diagnostic test, Wald test, likelihood ratio test 
and deviance/pseudo R-Square, and three measure of separation bivariate plots, gains/ 
lift chart, and concordance ratio. 


6.5.9.1 Wald Test 


The Wald test is analogous to the t-test test in linear regression. This is used to assess the 
contribution of individual predictors in a given model. 
In logistic regression, the Wald statistic is 


2 
wa 
J 2 

SE? 


Add, B is coefficient and SE is standard error of coefficient B. 

The Wald statistic is the ratio of the square of the regression coefficient to the square 
of the standard error of the coefficient, and it follows a chi-square distribution. The 
significant Wald statistics implies the predictor/independent variable is significant in the 
model. 
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Let's perform a Wald test on MembershipPoints and see if that is significant in model 
or not. 


#Wald test 
library (survey) 
regTermTest(Model logistic, "MembershipPoints", method ="Wald") 
Wald test for MembershipPoints 
in glm(formula = choice ~ MembershipPoints + IncomeClass + 
CustomerPropensity + 
LastPurchaseDuration, family = binomial(link = "logit"), 
data = train) 
F = 31.64653 on 12 and 172960 df: p= < 2.22e-16 


The p-value is less than 0.05, so at 95% confidence we can reject the null hypothesis 
that the coefficient’s value is zero. Hence, the MembershipPoints is statistically significant 
variable of model. 


6.5.9.2 Deviance 


Deviance is calculated by comparing a null model and a saturated model. A null model 
is amodel without any predictor in it, just the intercept term and a saturated model is the 
fitted model with some predictors in it. In logistic regression, deviance is used in lieu of 
sum of squares calculations. The test statistic (often denoted by D) is twice the log of the 
likelihoods ratio, i.e., it is twice the difference in the log likelihoods: 

Deviance 


likelihood of the fitted model 


D=-2ln-—2 EA 
likelihood of the saturated model 


Deviance statistic (D) follows a chi-square distribution. Smaller values indicate 
better fit as the fitted model deviates less from the saturated model. 
Here is the analysis of the deviance table. 
#Anova table of significance 
anova(Model logistic, test="Chisq") 
Analysis of Deviance Table 
Model: binomial, link: logit 


Response: choice 


Terms added sequentially (first to last) 
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Df Deviance Resid. Df Resid. Dev Pr(>Chi) 


NULL 172985 235658 
MembershipPoints 12 330.6 172973 235328 < 2.2e-16 *** 
IncomeClass 8 339.1 172965 234989 < 2.2e-16 *** 
CustomerPropensity 4 18297.4 172961 216691 < 2.2e-16 *** 
LastPurchaseDuration 1 2826.9 172960 213864 < 2.2e-16 *** 
Signif. codes: 0 ‘***' 0.001 ‘**' 0.01 ‘*' 0.05 '.' 0.1 °° 1 


The chi-square test on all the variables is significant as the p-value is less than 0.05. 
All the predictors’ contributions to the model are significant. 


6.5.9.3 Pseudo R-Square 


In linear regression, we have R-Square measure (discussed in detail in Chapter 7), which 
measures the proportion of variance independently explained by the model. A similar 
measure in logistics regression is called pseudo R-Square. The most popular of such 
measure used the likelihood ratio, which is presented as: 


D D fitted 


null 


Rp =a 


null 


The ratio of difference in deviance of null and fitted model by null model. The higher 
the value of this measure, the better the explaining power of model. There are other 
similar measures not discussed in this chapter, like Cox and Snell R-Square, Nagelkerke 
R-Square, McFadden R-Square, and Tjur R-Square. Here we compute the pseudo 
R-Square for our model. 


# R square equivalent for logistic regression 


library (psc1l) 
pR2(Model logistic) 
llh llhNull G2 McFadden r2ML 
-1.069321e+05 -1.178291e+05 2.179399e+04 9.248135e-02 1.183737e-01 
r2CU 


1.591199e-01 


The last three outputs from this function are McFadden's pseudo R-Square, 
Maximum likelihood pseudo R-Square (Cox & Snell) and Cragg and Uhler's or 
Nagelkerke's pseudo R-Square. The R-Square values are very low, signifying that the 
model might not be performing better than a null model. 
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6.5.9.4 Bivariate Plots 


The most important diagnostic of logistic regression is to see how the actual probabilities 
and predicted probabilities behave by each level of single independent variables. These 
plots are called bivariate as there are two variables actual and predicted plotted against 
single independent variable levels. The plot have three important inputs: 


e Actual Probability: The prior proportion of target level in each 
category of independent variable. 


e Predicted Probability: The probability given by the model. 


e Frequency: The frequency of a categorical variable (number of 
observations). 


The plot essentially tells us how the model is behaving for different levels in our 
categorical variables. You can extend this idea to continuous variables as well by binning 
the continuous variable. 

Another good thing about these plots is that you are able to determine for which 
cohort in your dataset the model performs better and where it need investigation. This 
cohort level diagnostic is not possible by looking at aggregated plots. 


#The function code is provided separately in the appendix 
source("actual pred plot.R") 
MODEL PREDICTION <-predict(Model logistic, Data Logistic, type ='response' ); 
Data Logistic$MODEL PREDICTION <-MODEL PREDICTION 
#Print the plots MembershipPoints 
actual_pred_ plot (var.by=as.character("MembershipPoints"), 
var.response=' choice’ , 
data=Data Logistic, 
var.predict.current='MODEL PREDICTION’ , 
var.predict.reference=NULL, 
var.split=NULL, 
var. by.buckets=NULL, 
sort. factor=FALSE, 
errorbars=FALSE, 
subset. to=FALSE, 
barline.ratio=1, 
title="Actual vs. Predicted Purchase Rates", 
make. plot=TRUE 
) 


The plot in Figure 6-28 shows actual versus predicted against the frequency plot of 
MembershipPoints. 
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Actual vs. Predicted Purchase Rates 
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Figure 6-28. Actual versus predicted plot against MembershipPoints 


For MembershipPoints, the actual and predicted probabilities follow each other. This 
means the model predicts the probabilities close to actual. Also, you can see in both cases 
customer having higher MembershipPoints are less likely to have product choice = 1, that 
is the same as seen in the actual and predicted. 


#Print the plots IncomeClass 
actual_pred_ plot (var.by=as.character("IncomeClass"), 
var.response='choice’ , 
data=Data Logistic, 
var.predict.current='MODEL PREDICTION’, 
var.predict.reference=NULL, 
var.split=NULL, 
var.by.buckets=NULL, 
sort. factor=FALSE, 
errorbars=FALSE, 
subset. to=FALSE, 
barline.ratio=1, 
title="Actual vs. Predicted Purchase Rates", 
make. plot=TRUE 
) 
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The plot in Figure 6-29 shows actual versus predicted against the frequency plot of 


IncomeClass. 


Actual vs. Predicted Purchase Rates 
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Figure 6-29. Actual versus predicted plot against IncomeClass 


Again the model behavior for IncomeClass is as expected. The model is able to 
predict the probabilities across different income classes as an actual observed rate. 


#Print the plots CustomerPropensity 
actual_pred_plot (var.by=as.character("CustomerPropensity"), 
var.response=' choice’ , 
data=Data Logistic, 
var.predict.current='MODEL PREDICTION’ , 
var.predict.reference=NULL, 
var. split=NULL, 
var. by.buckets=NULL, 
sort. factor=FALSE, 
errorbars=FALSE, 
subset. to=FALSE, 
barline.ratio=1, 
title="Actual vs. Predicted Purchase Rates", 
make. plot=TRUE 

) 
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The plot in Figure 6-30 shows actual versus predicted against the frequency plot of 
CustomerPropensity. 


Actual vs. Predicted Purchase Rates 
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Figure 6-30. Actual versus predicted plot against CustomerPropensity 


The model shows good agreement with observed probabilities for 
CustomerPropensity as well. Similar plots can be plotted for continuous variable in our 
model after binning them appropriately. At least on categorical variables, the model 
performs well. 

The model shows a good prediction against actual values across different categorical 
variable levels. It's a good model at the probability scale! 


6.5.9.5 Cumulative Gains and Lift Charts 


Cumulative gains and lift charts are visual ways to measure the effectiveness of predictive 
models. They consist of a baseline and the lift curve due to the predictive model. The more 
there is separation between baseline and predicted (lift) curve, the better the model. 

In a Gains curve: 

X-axis: % of customers 

Y-axis: Percentage of positive predictions 

Baseline: Random line (x% of customers giving x% of positive predictions) 

Gains: The percentage of positive responses for the % of customers 

In a Lift curve: 

X-axis: % of customers 
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Y-axis: Actual lift (the ratio between the result predicted by our model and the result 
using no model) 


library(gains) 

library (ROCR) 

library(calibrate) 

MODEL PREDICTION <-predict(Model logistic, Data Logistic, type ='response' ); 


Data Logistic$MODEL PREDICTION <-MODEL_ PREDICTION; 

lift =with(Data Logistic, gains(actual = Data Logistic$choice, predicted = 
Data_Logistic$MODEL PREDICTION , optimal =TRUE)); 

pred =prediction(MODEL PREDICTION, as.numeric(Data Logistic$choice) ); 


# Function to create performance objects. All kinds of predictor evaluations 
are performed using this function. 

gains =performance(pred, ‘tpr', ‘rpp'); 

# tpr: True positive rate 

# rpp: Rate of positive predictions 

auc =performance(pred, ‘auc'); 

auc =unlist(slot(auc, ‘y.values')); # The same as: aquc@y.values[[1]] 

auct =paste(c('AUC = '), round(auc, 2), sep ='') 


#par(mfrow=c(1,2), mar=c(6,5,4,2)); 


plot(gains, col='red', lwd=2, xaxs='i', yaxs='i', main =paste('Gains 
Chart ', sep =''),ylab='% of Positive Response’, xlab='% of customers/ 
population’ ); 

axis(side =1, pos =0, at =seq(0, 1, by =0.10)); 

axis(side =2, pos =0, at =seq(0, 1, by =0.10)); 


lines(x=c(0,1), y=e(0,1), type='1', col='black', lwd=2, 
ylab='% of Positive Response’, xlab='% of customers/population' ); 


legend(0.6, 0.4, auct, cex =1.1, box.col ='white') 


gains =lift$cume.pct.of.total 
deciles =length(gains) ; 


for (j in 1:deciles) 


x =0.1; 
y =as.numeric(as.character(gains[[j]])); 
lines(x =c(x*j, x*j), 
y =c(0, y), 
type ='1', col ='blue', lwd =1); 
lines(x =c(0, 0.1*j), 
y =c(y, y), 
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type ='1', col ='blue', lwd =1); 

# Annotating the chart by adding the True Positive Rate exact numbers at the 
specified deciles. 

textxy(0, y, paste(round(y,2)*100, '%',sep=''), cex=0.9); 


The chart in Figure 6-31 is the Gains chart for our model. This is plotted with % of 
positive responses on the y axis and % of population on the x axis. 


Gains Chart 


1.0 


% of Positive Response 





00 02 04 06 O08 


00 01 02 03 04 05 06 07 OB AY 1.0 


% of customers/population 


Figure 6-31. Gains charts with AUC 


plot(lift, 

xlab ='% of customers/population' , 

ylab ='Actual Lift’, 

main =paste('Lift Chart \n', sep =' '), 

xaxt ='n'); 

axis(side =1, at =seq(0, 100, by =10), las =1, hadj =0.4); 


The chart in Figure 6-32 is the Lift chart for our model. This is plotted with actual lift 
on the y axis and % of population on the x axis. 
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Figure 6-32. Lift chart 


The Gains shows a good separation between model line and baseline. This shows 
that the model has good separation power. Also the AUC value of 0.7 means that the 
model will be able to roughly separate 70% of cases. 

The Lift curve shows a lift of close to 70% for the first 10% of the population. This 
value need to be the same as what we have observed in Gains chart, only the presentation 
has changed. 


6.5.9.6 Concordance and Discordant Ratios 


In any classification model based on raw probabilities, we need a classification 
methodology to separate these probability cases. In binary logistic, this is most of the 
time done by choosing a cutoff value and then creating an inequality to classify objects 
into 0 or 1. 

To make sure such a cutoff exists and has good separation power, we have to see if 
the actual objects with state 1 in data are having higher probability than the actual state 0. 
For example, a pair of (Yi,Yj) be (0,1), then the predicted values (Pi,Pj) should have Pj>Pi, 
then we can choose a number between Pj and Pi, which will correctly classify a 1 as 1 and 
0 as 0. Based on this understanding, we can divide all the possible pairs in data into three 


types: 


e Concordant pairs: For (0,1) or (1,0) corresponding probabilities 
with 1 are greater than probabilities with 0 


e Discordant pairs: For (0,1) or (1,0) corresponding probabilities 
with 0 are greater than probabilities with 1 


e Tied: (0,0) and (1,0) pairs 
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The concordance ratio is then defined as the ratio of the number of concordant pairs 
by the total number of pairs. 

If our model produces a high concordance ratio then it will be able to classify the 
objects in classes more accurately. 


#The function code is provided separately in R-code for chapter 6 
source("concordance.R" ) 


#Call the concordance function to get these ratios 
concordance(Model logistic) 

$Concordance 

[1] 0.7002884 


$Discordance 
[1] 0.2991384 


$Tied 
[1] 0.0005731122 


$Pairs 
[1] 7302228665 


The concordance is 69.9%, signifying that the model probabilities have good 
separation on ~70% cases. This is a good model to create a classifier for 0 and 1. 

In this section of model diagnostics for logistic regression we discussed the 
diagnostics in broadly two bucket, model fit statistics and model classification power. 
The model fit statistics discussed Wald test, which is significance test for the parameter 
estimates, Deviance is similar measure to residual in linear regression and pseudo 
R-Square, which is equivalent to R-Square of liner regression. The other set of diagnostics 
were to identify if the model can be used to create a powerful classifier. The test included 
were bivariate plots, this is a plot of actual probability by predicted probabilities, 
cumulative gains and lift chart to show hoe well our model differentiate between two 
classes, and the concordance ratio tells us if we can have a good cutoff value for our 
classifier. These diagnostics provide us vital properties of the model and help the modeler 
to either improve or re-estimate the models. 

In the next section, we will more onto multi-class classification problems. Multi- 
class problems are one of the hardest problems to solve, as more number of classes bring 
ambiguity. It is difficult to create a good classifier in many cases. We will discuss multiple 
ways you can do multi-class classification using machine learning. Multinomial logistic 
regression is one the popular ways to to multi-class classification. 


6.5.10 Multinomial Logistic Regression 


Multinomial Logistic Regression is used when we have more than one category for 
classification. The dependent variable in that case follows a multinomial distribution. In 
the background we create a logistic model for each class and then combine those into one 
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single equation by making the probability constraint of the sum of all probabilities be 1. 
The equation setup for the multinomial logistic is shown here: 


e’ Xi 


Pr(Y,=K-1) =— x — 
l+% e” 
k=1 


The estimation process has a additional constraints on individual logit 
transformation, the sum of probabilities from all logit functions needs to be 1. As the 
estimation has to take care of this constraint, the estimation method is iterative one. The 
best coefficients for the model are found by iterative optimization of the logLoss function. 

For our purchase prediction problem first we will fit a logistic model on our data. The 
multinom() function from the nnet package will be used to estimate the logistic equation 
for our multi-class problem (ProductChoice has four possible options). Once we get the 
probabilities for each class, we will create a classifier to assign classes to individual cases. 

There will be two methods illustrated for the classifier: 


Pick the highest probability: Pick the class having the highest 
probability among all the possible classes. However, this 
technique suffers form class imbalance problem. Class imbalance 
problem occurs when prior distribution of high proportion class 
drive the predicted probability and hence the low proportion 
classes never got assign the class using predicted probabilities 
maximum value. 


Ratio of probabilities: We can take a ratio of predicted 
probabilities by prior distribution and then choose a class based 
on the the highest ratio. Highest ratio will signify that the model 
picked the highest signal as the probabilities got normalized by 
prior proportion. 


Let's fit a model and apply the two classifiers. 


#Remove the data having NA. NA is ignored in modeling algorithms 
Data _Purchase<-na.omit(Data Purchase Prediction) 


rownames (Data Purchase)<-NULL 


#Random Sample for easy computation 
Data Purchase Model <-Data Purchase|sample(nrow(Data Purchase) ,10000), | 
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print("The Distribution of product is as below") 
[1] "The Distribution of product is as below" 
table(Data Purchase Model$ProductChoice) 


1 2 3 4 
2192 3883 2860 1065 
#fit a multinomial logistic model 
library (nnet) 
mnl model <-multinom (ProductChoice ~MembershipPoints +IncomeClass 
+CustomerPropensity +LastPurchaseDuration +CustomerAge +MartialStatus, data 
= Data Purchase) 
# weights: 44 (30 variable) 
initial value 672765.880864 
iter 10 value 615285.850873 
iter 20 value 607471.781374 
iter 30 value 607231.472034 
final value 604217.503433 
converged 
#Display the summary of model statistics 
mnl_model 
Call: 
multinom(formula = ProductChoice ~ MembershipPoints + IncomeClass + 
CustomerPropensity + LastPurchaseDuration + CustomerAge + 
MartialStatus, data = Data Purchase) 


Coefficients: 
(Intercept) MembershipPoints IncomeClass CustomerPropensityLow 

2 0.77137077 -0.02940732 0.00127305 -0.3960318 

3 0.01775506 0.03340207 0.03540194 -0.8573716 

4 -1.15109893 -0.12366367 0.09016678 -0.6427954 
CustomerPropensityMedium CustomerPropensityUnknown 

2 -0.2745419 -0.5715016 

3 -0.4038433 -1.1824810 

4 -0.4035627 -0.9769569 
CustomerPropensityVeryHigh LastPurchaseDuration CustomerAge 

2 0.2553831 0.04117902 0.001638976 

3 0.5645137 0.05539173 0.005042405 

4 0.5897717 0.07047770 0.009664668 
MartialStatus 


2 -0.033879645 
3 -0.007461956 
4 0.122011042 


Residual Deviance: 1208435 
AIC: 1208495 
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The model result shows that it converged after 30 iterations. Now let’s see a sample 
set of probabilities assigned by the model and then apply the first classifier that has 
picked the highest probability. 

Here, we apply the highest probability classifier and see how it classifies the cases. 


#Predict the probabilities 
predicted test <-as.data.frame(predict(mnl model, newdata = Data Purchase, 
type="probs")) 


head(predicted test) 
1 2 3 4 
1 0.21331014 0.3811085 0.3361570 0.06942438 
2 0.05060546 0.2818905 0.4157159 0.25178812 
3 0.21017415 0.4503171 0.2437507 0.09575798 
4 0.24667443 0.4545797 0.2085789 0.09016690 
5 0.09921814 0.3085913 0.4660605 0.12613007 
6 0.11730147 0.3624635 0.4184053 0.10182971 
#Do the prediction based in highest probability 
test_result <-apply(predicted test,1,which.max) 


result <-as.data.frame(cbind(Data Purchase$ProductChoice,test_ result) ) 


colnames(result) <-e("Actual Class", "Predicted Class") 
table(result$ Actual Class’ ,result$ Predicted Class~ ) 


1 2 3 
302 91952 12365 
248 150429 38028 
170 90944 51390 

27 32645 16798 


PUNEA 


The model shows good result for classifying classes 1, 2, and 3, but for class 4 the 
model does not classify even a single case. This is happening because the classifier 
(picking the highest probability) is very sensitive to absolute probabilities. This is called 
class imbalance and is discussed in start of the section. 

Let's apply the second method we discussed in start of the section, probability ratios, 
to classify. We will select the class based on the ratio of predicted probability to the prior 
probability/proportion. This way we will be ensuring the classifier assign the class which 
is providing the highest jump in probabilities. In other words, the ratio will normalize the 
probabilities by prior odds, therefore reducing the bias due to prior distributions. 


prior <-table(Data Purchase Model$ProductChoice)/nrow(Data Purchase Model) 
prior mat <-rep(prior,nrow(Data Purchase Model) ) 
pred ratio <-predicted test/prior mat 


#Do the prediction based in highest ratio 
test_result <-apply(pred_ ratio,1,which.max) 
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result <-as.data.frame(cbind(Data_ Purchase$ProductChoice,test result) ) 


colnames(result) <-e("Actual Class", "Predicted Class") 
table(result$ Actual Class’ ,result$ Predicted Class ) 


1 2 3 4 
1 21251 64410 18935 23 
2 28087 112480 48078 60 
3 13887 77090 51476 51 
4 4620 27848 16958 44 


Now you can see the class imbalance problem is reduced to some extent. You are 
encouraged to try other methods of sampling to reduce this problem further. Multinomial 
models are very popular in multi-class classification problems, other alternatives 
algorithms for multi-class classification tend to be more complex than multinomial. 
Multinomial logistic classifiers more commonly used in natural language processing and 
multi-class problems than Naive Bayer classifiers. 


6.5.11 Generalized Linear Models 


Generalized linear models extend the idea of ordinary linear regression to other 
distributions of response variables in an exponential family. 

In the GLM framework, we assume that the dependent variable is generated from a 
exponential family distribution, exponential family include normal, binomial, Poisson, 
and gamma distributions, among others. The expectation in that case is defined as: 


B(Y)=4=8" (XB) 


where E(Y) is the expected value of Y; X$ is the linear predictor, a linear combination 
of unknown parameters §; g is the link function. 

The model parameters, p, are typically estimated with maximum likelihood, 
maximum quasi-likelihood, or Bayesian techniques. 

The glm function is very generic function that can accommodate many types of 
distributions in a response variable: 


glm(formula, family=familytype(link=linkfunction), data=) 
e binomial, (link = "logit") 


Binomial distribution is very common in the real world. Any 
problem that has two possible outcomes can be thought of as 
a binomial distribution. A simple example could be whether it 
will rain today ( =1) or not (=0). 


e gaussian, (link= "identity") 


Gaussian distribution is a continuous distribution, i.e., anormal 
distribution. All the problems in linear regression are modeled 
assuming Gaussian distribution on the dependent variable. 
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e Gamma, (link= "inverse" 


An example could be, “N people are waiting at a take-away. 
How long will it take to serve them”? OR time to failure of a 
machine in the industry. 


e poisson, (link = "log") 


This is a common distribution in queuing examples. One 
example could be “How many calls will the call center receive 
today?” 


The following exponential family is also supported by the g1m() function in R. 
However, these distributions are not observed normally in day-to-day activities. 


inverse.gaussian, (link = "1/mu*2") 

quasi, (link = "identity", variance = "constant") 
quasibinomial, (link = "logit") 

quasipoisson, (link = "log") 


6.5.12 Conclusion 


Regression is one of the very first learning algorithm with a heavy influence from statistics 
but an elegantly simple design. Over the years, the complexity and diversity in regression 
technique has increased many folds as new applications started emerging. In this book 
we gave a heavy share of pages to regression in order to bring the best out of the widely 
used regression techniques. We discussed from the most fundamental simple regression 
to the advanced polynomial regression with a heavy emphasis on demonstration in R. 
The interested readers are advised to refer further to some advanced text of the topic if 
they want to go deeper into regression theory. 

We have also presented a detailed discussion of model diagnostics for regression, 
which is the most overlooked topic when developing real-world models but could bring 
monumental damage to the industry where it’s applied, especially if it’s not done properly. 

In the next section, we will cover a technique from the distance-based algorithm 
called Support Vector Machine, which could be a really good binary classification model 
on higher dimensional datasets. 


6.6 Support Vector Machine SVM 


In the R function libsvm documentation titled Support Vector Machine by David Meyer 
gave a crisp brief of SVM describing the class separation, handling overlapping classes, 
dealing with Nonlinearity, and modeling of problem solution. The following are the 
excerpts from the documentation: 


a. Class separation 


SVM looks for the optimal hyperplane (In two dimensions, a hyperplane is a line and 
in a p-dimensional space, a hyperplane is a flat affine subspace of hyperplane dimension 
p - 1) separating the two classes by maximizing the margin between the closest points 
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of the two classes (see Figure 6-58). In two-dimensional space, as shown in Figure 6-4, 
the points lying on the margins are called support vectors and line passing through the 
midpoint of margins is the optimal hyperplane. 

Simple two-dimensional hyperplane for a linearly separable data are represented by 
the following two equations: 


w-x+b=1 
and 
w-x+b=-1 


subject to the following constraint so that each observation lies on the correct side of 
the margin 


y;(w-x,+b)21, for alll <i<n. 


b. Overlapping classes 


If the data points reside on the wrong side of the discriminant margin, it could be 
weighted down to reduce its influence (in this setting, the margin is called a soft margin). 

The following function, called the Hinge loss, function can be introduced to handle 
this situation 


max(0,1-y, (w-x; +b)). 


which becomes 0 if x, lies on the correct side of the margin and the function value is 
proportional to the distance from the margin. 
c. Nonlinearity 


If a linear separator couldn't be found, observations are usually projected into a 
higher-dimensional space using a kernel function where the observations effectively 
become linearly separable. 

One popular Gaussian family kernel is the radial basis function. A radial basis 
function (RBF) is a real-valued function whose value depends only on the distance from 
the origin. The function can be defined here: 


eyl 


K(x,y)=exp 73 


where [x-y is known as squared Euclidean distance between the observation x and 
y. There are other linear, polynomial, and sigmoidal kernels that could be used based on 
the data. 

A program that can perform all these tasks is called a support vector machine. 
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Figure 6-33. Classification using support vector machine 


6.6.1 Linear SVM 


The problem could be formulated as a quadratic optimization problem which can be 
solved by many known techniques. The following expressions could be converted to a 
quadratic function (the discussion of this topic is beyond the scope of this book). 


6.6.1.1 Hard Margins 


For the hard margins 


Minimize |w] 





, subject to y,(w-x,-b)21, for i=1,...n 


6.6.1.2 Soft Margins 


For the soft margins 


1X Denn 7 
Minimize +$ max(o1-y (0x +b) fe aba 
i=l 


where 


Parameter à determines the tradeoff between increasing the margin and ensuring <x, 


lies on the correct side of the margin. 
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6.6.2 Binary SVM Classifier 


Let’s look at the classification of benign and malignant cells in our breast cancer dataset. 
We want to create a binary classifier to classify cells into benign and malignant. 


a. Data summary 


library(e1071) 
library (rpart) 


breast_cancer_ data <-read.table("Dataset/breast-cancer-wisconsin.data. 


txt" ,sep=", 
breast_cancer_data$V11 =as.factor(breast_cancer_data$V11) 


summary(breast_cancer_data) 


V1 V2 V3 V4 
Min. : 61634 Min. : 1.000 Min. 1.000 Min. : 1.000 
1st Qu.: 870688 1st Qu.: 2.000 1st Qu.: 1.000 1st Qu.: 1.000 
Median : 1171710 Median : 4.000 Median : 1.000 Median : 1.000 
Mean : 1071704 Mean : 4.418 Mean : 3.134 Mean : 3.207 
3rd Qu.: 1238298 3rd Qu.: 6.000 3rd Qu.: 5.000 3rd Qu.: 5.000 
Max. :13454352 Max. :10.000 Max. :10.000 Max. :10.000 
V5 V6 V7 V8 
Min. 1.000 Min. : 1.000 1 : 402 Min. 1.000 
1st Qu.: 1.000 1st Qu.: 2.000 10 :132 1st Qu.: 2.000 
Median : 1.000 Median : 2.000 2 : 30 Median : 3.000 
Mean 2.807 Mean 3.216 5 : 30 Mean 3.438 
3rd Qu.: 4.000 3rd Qu.: 4.000 3 : 28 3rd Qu.: 5.000 
Max. 710.000 Max. 710.000 8 : 21 Max. :10.000 
(Other): 56 
V9 V10 V11 


Min. 1.000 Min. 1.000 2:458 
1st Qu.: 1.000 1st Qu.: 1.000 4:241 
Median : 1.000 Median : 1.000 

Mean 2.867 Mean 1 

3rd Qu.: 4.000 3rd Qu.: 1 

Max. :10.000 Max. :10.000 


b. Data preparation 


split data into a train and test set 
index <-1:nrow(breast_cancer_data) 
test_data_index <-sample(index, trunc(length(index)/3)) 
test_data <-breast_ cancer data[test data index, ] 
train data <-breast_cancer data[-test data index, | 
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c. Model building 
svm.model <-swm(V11 ~., data = train data, cost =100, gamma =1) 
d. Model evaluation 


Normally, such a high level of accuracy is only possible 

if feature being used and the data matches the real world 
very Closely. Such a dataset in practical scenarios is difficult 
to built, however, in the world of medical diagnostics, 
expectation is always very high in terms of accuracy, as error 
involves a significant risk to somebody’s life. 


Training set accuracy = 100% 
library (gmodels) 
svm pred train <-predict(svm.model, train datal[,-11]) 
CrossTable(train data$V11, svm pred train, 
prop.chisq =FALSE, prop.c =FALSE, prop.r =FALSE, 
dnn =e('actual default’, ‘predicted default')) 
Cell Contents 


| N | 


Total Observations in Table: 466 


| predicted default 


actual default | 2 | 4 | Row Total | 
---===-===== ==- [----===-===|--======== |---+------Ś | 
2 | 303 | o | 303 | 
| 0.650 | 0.000 | | 
a [-----======|--=-====-+= |----------Ś | 
4 | o | 163 | 163 | 
| 0.000 | 0.350 | | 
------=-=-=- ==- [-----======|--=-=====+= |----------Ś | 
Column Total | 303 | 163 | 466 | 

| | 
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Testing set accuracy = 95% 
svm pred test <-predict(svm.model, test data[,-11]) 
CrossTable(test_data$V11, svm pred test, 
prop.chisq =FALSE, prop.c =FALSE, prop.r =FALSE, 
dnn =e('actual default’, ‘predicted default')) 

Cell Contents 


| N | 


Total Observations in Table: 233 


| predicted default 


actual default | 2 | 4 | Row Total | 
Aoa aR [aaee a a] 
2 | 142 | 13 | 155 | 
| 0.609 | 0.056 | | 
Pines cicbate neato a cas es cee aan Gas waan ee! 
4 | o | 78 | 78 | 
| 0.000 | 0.335 | | 
nA a a E peenearaaese | 
Column Total | 142 | 91 | 233 | 

| | | 


The binary SVM has done exceptionally well on the breast cancer dataset, which is 
the golden mark for the dataset as described in the UCI Machine Learning Repository. 
The classification matrix shows that the correct classification of 95% (58.4% of malignant 
and 36.9% of benign cells correctly identified). 


6.6.3 Multi-Class SVM 


We introduced SVM as a binary classifier. However the idea of SVM can be extended to 
multi-class classification problem as well. Multi-class SVM can be used as a multi-class 
classifier by creating multiple binary classifiers. This method works similarly to the idea 
of multinomial logistic regression, where we build a logistic model for each pair of class 
with base function. 
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Along the same lines, we can create a set of binary SVMs to do multi-class 
classification. The steps in implementing that will be as follows: 


1. Create binary classifiers: 
e Between one class and the rest of the classes 
e Between every pair of classes (all possible pairs) 


2. For any new cases, the SVM classifier adopts a winner-takes- 
all strategy, in which the class with highest output is assigned. 


To implement this methodology, it is important that the output functions are 
calibrated to generate comparable scores, otherwise the classification will become 
biased. 

There are other methods also for multi-class SVMs. One such is proposed by 
Crammer and Singer. They proposed a multi-class SVM method which casts the multi- 
class classification problem into a single optimization problem, rather than decomposing 
it into multiple binary classification problems. 

We will show quick example with our house worth data. The house net worth is 
divided into three classes—high, medium, and low. The multi-class SVM has to classify 
house into these categories. Here is the R implementation of the SVM multi-class 
classifier: 


# Read the house Worth Data 
Data House Worth <-read.csv("Dataset/House Worth Data.csv",header=TRUE) ; 


library( 'e1071' ) 
#Fit a multiclass SVM 
svm multi model <-swm( HouseNetWorth ~StoreArea +LawnArea, Data House Worth ) 


#Display the model 
svm_multi_model 


Call: 
svm(formula = HouseNetWorth ~“ StoreArea + LawnArea, data = Data House_ 
Worth) 


Parameters: 
SVM-Type: C-classification 
SVM-Kernel: radial 
cost: 1 
gamma: 0.5 


Number of Support Vectors: 120 


#get the predicted value for all the set 
res <-predict( svm multi_model, newdata=Data House Worth ) 
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#Classification Matrix 
table(Data House Worth$HouseNetWorth, res) 


res 

High Low Medium 
High 122 1 7 
Low 6 122 7 


Medium 1 7 43 
#Classification Rate 


sum(diag(table(Data House Worth$HouseNetWorth, res) ))/nrow(Data House Worth) 
[1] 0.9082278 


Multi-class SVM gives us 90% good classification rate on house worth data. The 
prediction is good across all the classes. 


6.6.4 Conclusion 


Support vector machine, which initially was a non-probabilistic binary classifier with 
later variations to solve for multi-class problems as well has proved to be one of the 
most successful algorithms in machine learning. A number of applications of SVM 
emerged over the years, and a few noteworthy ones are hypertext categorization, image 
classification, character recognition, and many more applications in biological sciences 
as well. 

This section discussed a brief introduction to SVM with both binary and multi-class 
versions on Breast Cancer and House Worth Data. In the next section, we discuss the 
decision tree algorithm, which is an another classification and as well as regression type 
model and a very popular approach in many fields of study. 


6.7 Decision Trees 


Unlike other ML algorithms based on statistical techniques, decision tree is a non- 
parametric model, having no underlying assumptions for the model. However, we should 
be careful in identifying the problems where a decision tree is appropriate and where 
not. Decision tree’s ease of interpretation and understanding has found its usage in 
many applications ranging from agriculture, where you could predict the chances of rain 
given the various environmental variables, to software development, where it’s possible 
to estimate the development effort given the details about the modules. Over the years, 
tree-based approaches have evolved into a much broader scope in applicability as well 
as sophistication. They are available both in case of discrete and continuous response 
variables, which makes it a suitable solution in for both classification and regression 
problems. 
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More formally, a decision tree D consists of two types of nodes: 


e A leaf node, which indicates the class/region defined by the 
response variable. 


e A decision node, which specifies some test on a single attributes 
(predictor variable) with one branch and subtree for each 
possible outcome of the test. 


A decision tree once constructed can be used to classify a observation by starting at 
the top decision node (called the root node) and moving down through the other decision 
nodes until a leaf is encountered using a recursive divide and conquer approach. Before 
we get into the details of how the algorithm works, let’s get some familiarity with certain 
measures and its importance for the decision tree building process. 


x < 0.43? 


Yes No 





Figure 6-34. Decision tree with two attributes and a class 


6.7.1 Types of Decision Trees 


Decision tree offers two types of implementations, one for regression and the other for 
classification problems. This means you could use decision tree for categorical as well 
as continuous response variables, which makes it a widely popular approach in the 
ML world. Next, we briefly describe how the two problems could be modeled using a 
decision tree. 
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6.7.1.1 Regression Trees 


In problems where the response variable is continuous, regression trees are useful. 
They provide the same level of interpretability as Linear Regression and on top of that, 
they are a very intuitive understanding of final output, which ties back to the domain 
of the problem. In the previous example in the Figure 6-34, the regression tree is built 
to recursively split the two feature vector space into different regions based on various 
thresholds. Our objective in splitting the regions in every iteration is to minimize the 
Residual Sum of Squares (RSS) defined by the following equation: 


> (vi-a) + DY (vi-Fuo) 


i:x, ER, (j,s) i:x; ER, (js) 


Overall, the following are the two steps involved in regression tree building and 
prediction on new test data: 


e Recursively split the feature vector space (X, X spines X) into 
distinct and non-overlapping regions 


e For new observations falling into the same region, the prediction 
is equal to the mean of all the training observations in that region. 


In n-dimensional feature vector, Gini-index or Entropy, measures of classification 
power of node could be used to choose the right feature to split the space into different 
regions. Variance reduction (not covered in this book) is another popular approach which 
appropriately discretizes the range of continuous response variable in order to choose the 
right thresholds for splitting. 


R; 


Ej = y - 
ni l F Ro Ra 


y Ra his 


Figure 6-35. Recursive binary split and its corresponding tree for two-dimensional feature 
space 
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We will take up demonstration using a regression tree algorithm called Classification 
and Regression Tree (CART) later in this chapter, on the housing dataset described earlier 
in this chapter. 


6.7.1.2 Classification Tree 


Classification tree is more suitable for categorical or discrete response variables. The 
following are the key differences between classification trees and regression trees: 


e Weuse classification error rate for making the splits in 
classification trees. 


e Instead of taking the mean of response variable in a particular 
region for prediction, here we use the most commonly occurring 
class of training observation as a prediction methodology. 


Again, we could use Gini-Index or Entropy as a measure for selecting the best feature 
or attribute of splitting the observations into different classes. 

In the coming sections, we will discuss some popular classification decision tree 
algorithms like ID3 and C5.0 


6.7.2 Decision Measures 


There are certain measures which are key to building a decision tree. In this section, 

we will discuss few measures associated with node purity (measure of randomness or 
heterogeneity). In the context of decision tree, a small value signifies the node contains 
majority of the observation from a single class. There are two widely used measure for 
node purity and another measure called information gain which uses either Gini-index or 
Entropy to take decision on node split. 


6.7.2.1 Gini Index 


A Gini-Index is calculated using 
K 
G = Pa *(1-p,x.) 
k=1 


where, p „is the proportion of training observations in the m" region that are from 
the k" class. 

For demonstration purposes, look at Figure 6-36. Suppose you have two classes 
where P1 proportion of all training observation belongs to class C, (triangle) and then P2 
= 1-P1 belongs to C, (circle), the Gini-Index assumes a curve as shown here: 
curve(x *(1-x) +(1 -x) *x, xlab ="P", ylab ="Gini-Index", lwd =5) 
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Gini-Index 





0.0 0.2 0.4 0.6 0.8 1.0 


Figure 6-36. Gini-Index function 


6.7.2.2 Entropy 


Entropy is calculated using 


K 
E = -Š Pax log2(1-p,,x ) 


k=1 


The curve for entropy looks something like this: 


curve(-x *log2(x) -(1 -x) *log2(1 -x), xlab ="x", ylab ="Entropy", lwd =5) 


x 
La 
> o 
=" 
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b 
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0.0 0.2 0.4 0.6 0.8 1.0 


Figure 6-37. Entropy function 
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Observe that both measures are very similar, however, there are some differences: 


e Gini-index is more suitable to continuous attributes and entropy 
in case of discrete data. 


e  Gini-index works well for minimizing misclassifications. 


e Entropy is slightly slower than Gini-index, as it involves 
logarithms (although this doesn't really matter much given 
today's fast computing machines). 


6.7.2.3 Information Gain 


Information Gain is a measure that quantifies the change in the entropy before and 
after the split. It’s an elegantly simple measure to decide the relevance of an attribute. In 
general, we could write information gain as: 


IG= |G (Parent )— Average ( G(Children )) | , 


where G(Parent) is the Gini-Index (we could use Entropy as well) of parent node 
represented by an attribute before the split and G(Children) is the Gini-index of children 
nodes that will be generated after the split. For example, in Figure 6-34, all observations 
satisfying the parent node condition x<0.43 are its left child nodes and remaining its right. 


6.7.3 Decision Tree Learning Methods 


In this section, we discuss four widely used decision tree algorithms applied to our real- 
world datasets. 


a. Data Summary 


library(C50) 
library(splitstackshape) 
library(rattle) 
library(rpart.plot) 
library(data.table) 


Data Purchase <-fread("/Dataset/Purchase Prediction Dataset.csv",header=T, 
verbose =FALSE, showProgress =FALSE) 
str(Data_Purchase) 


Classes 'data.table' and 'data.frame': 500000 obs. of 12 variables: 
$ CUSTOMER_ID : chr "000001" "000002" "000003" "000004" ... 
$ ProductChoice te 23 232.32 2:2 3 «0 
$ MembershipPoints : int 6242665953... 
$ ModeOfPayment : chr “MoneyWallet" "CreditCard" 


"MoneyWallet" “MoneyWallet” ... 
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$ ResidentCity : chr “Madurai” "Kolkata" "Vijayawada" 
"Meerut" ... 

$ PurchaseTenure : int 441063313198... 

$ Channel : chr "Online" "Online" "Online" "Online" ... 

$ IncomeClass schr A SY "A sas 

$ CustomerPropensity : chr ‘Medium’ “VeryHigh" "Unknown" "Low" 

$ CustomerAge > int 55 75 34 26 38 71 72 27 33 29... 

$ MartialStatus > int 0000100001... 


$ LastPurchaseDuration: int 4 1515661054156... 
- attr(*, ".internal.selfref")=<externalptr> 
#Check the distribution of data before grouping 
table(Data_ Purchase$ProductChoice) 


1 2 3 4 
106603 199286 143893 50218 


b. Data Preparation 


#Pulling out only the relevant data to this chapter 
Data Purchase <-Data Purchase[,.(CUSTOMER ID,ProductChoice,MembershipPoints, 
IncomeClass,CustomerPropensity, LastPurchaseDuration) | 


#Delete NA from subset 
Data Purchase <-na.omit(Data Purchase) 
Data_Purchase$CUSTOMER ID <-as.character(Data Purchase$CUSTOMER_ID) 


#Stratified Sampling 
Data Purchase Model<-stratified(Data Purchase, group=e("ProductChoice"),size 
=10000, replace=FALSE) 


print("The Distribution of equal classes is as below") 
[1] "The Distribution of equal classes is as below" 
table(Data Purchase Model$ProductChoice) 


1 2 3 4 
10000 10000 10000 10000 
Data_Purchase Model$ProductChoice <-as.factor(Data_Purchase_ 
Model$ProductChoice) 
Data_Purchase_Model$IncomeClass <-as.factor(Data Purchase Model$IncomeClass) 
Data Purchase Model$CustomerPropensity <-as.factor(Data Purchase __ 
Model¢CustomerPropensity) 


#Build the decision tree on Train Data (Set_1) and then test data (Set 2) 
will be used for performance testing 
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set.seed(917); 

train <-Data Purchase Model[sample(nrow(Data Purchase Model) ,size=nrow(Data_ 
Purchase Model)*(0.7), replace =TRUE, prob =NULL), ] 

train <-as.data.frame(train) 


test <-Data Purchase Model[! (Data Purchase Model$CUSTOMER ID 
*instrain$CUSTOMER ID), | 


6.7.3.1 Iterative Dichotomizer 3 


J.Ross Quinlan, a computer science researcher in data mining and decision theory, 
invented the most popular decision tree algorithms, C4.5 and ID3. Here is a brief of how 
the ID3 algorithm works: 


1. Calculates entropy for each attribute using the training 
observations. 


2. Split the observations into subsets using the attribute with 
minimum entropy or maximum information gain. 


3. The selected attribute becomes the decision node. 
4. Repeat the process with the remaining attribute on the subset. 


For demonstration, we will use a R Package called RWeka, which is a wrapper built 
on the tool Weka, which is a collection of machine learning algorithms for data mining 
tasks written in Java, containing tools for data pre-processing, classification, regression, 
clustering, association rules, and visualization. The package RWeka contains the interface 
code, and the Weka jar is in a separate package called RWekajars. For more information 
on Weka, see http: //www.cs.waikato.ac.nz/ml/weka/. 


Note Before using the ID3 function from the RWeka package, follow these instructions. 


1. Install the package RWeka. 


2. Setthe environment variable WEKA HOME to a folder location 
on your drive (e.g., D: \nome) where you have sufficient access 
rights. 


3. Inthe Rconsole, run these two commands: 


WPM("refresh-cache" ) 
#looks for a package providing id3 


WPM("install-package", "simpleEducationalLearningSchemes" ) 
#load the package 
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a. Model Building 
library (RWeka) 


WPM("refresh-cache") 
WPM("install-package", "simpleEducationalLearningSchemes" ) 


make classifier 
ID3 <-make_Weka_classifier("weka/classifiers/trees/Id3") 


ID3Model <-ID3(ProductChoice ~CustomerPropensity +IncomeClass , 
data = train) 


summary (ID3Mode1 ) 

=== Summary === 

Correctly Classified Instances 9423 33.6536 % 
Incorrectly Classified Instances 18577 66.3464 % 
Kappa statistic 0.1148 

Mean absolute error 0.3634 

Root mean squared error 0.4263 

Relative absolute error 96.9041 % 

Root relative squared error 98.4399 % 

Total Number of Instances 28000 


=== Confusion Matrix === 


a b c d <-- classified as 


4061 987 929 1078 | a=1 
3142 1054 1217 1603 | b = 2 
2127 727 1761 2290 | c=3 
2206 859 1412 2547 | d=4 


b. Model Evaluation 


Training set accuracy are present as part of the ID3 model output (33% correctly 
classified instances), so we needn't present that here. Let’s look at the testing set 
accuracy. 

Testing set Accuracy = 32% 

As you can observe, the accuracy is not exceptionally good. Moreover, given there 
are four classes of data, the accuracy is prone to be even less. The fact that training and 
testing accuracy are almost equal tells us that there is as such no overfitting kind of 
scenario. 
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library (gmodels) 

purchase pred test <-predict(ID3Model, test) 
CrossTable(test$ProductChoice, purchase pred test, 
prop.chisq =FALSE, prop.c =FALSE, prop.r =FALSE, 
dnn =e('actual default', ‘predicted default')) 


Cell Contents 


| N | 


Total Observations in Table: 20002 


| 

actual default | 1 | 2 | 3 | 4 | Row Total | 
------------=-- [--------|-----------|-----------|----------= |-------- -=+ | 
1 | 2849 | 758 | 625 | 766 | 4998 | 
| 0.142 | 0.038 | 0.031 | 0.038 | | 
------------=-- [--------|----------- |-----------|----------- | -------- -=+ | 
2 | 2291 | 681 | 872 | 1151 | 4995 | 
| 0.115 | 0.034 | 0.044 | 0.058 | | 
------------=-- [--------|---------=- |-----------|----------= | -------- -=+ | 
3 | 1580 | 545 | 1151 | 1759 | 5035 | 
| 0.079 | 0.027 | 0.058 | 0.088 | | 
------------ =-= [--------|---------=- |-----------|----------- | -------- -=+ | 
4 | 1594 | 590 | 1066 | 1724 | 4974 | 
| 0.080 | 0.029 | 0.053 | 0.086 | | 
------------ =-= [--------|---------=- |---------=-|----------- | -------- -=+ | 
Column Total | 8314 | 2574 | 3714 | 5400 | 20002 | 

| | | | | 


The accuracy isn’t very impressive with ID3 algorithm. One possible reason could 
be the multiple classes in our response variables. Generally, the ID3 is not known for 
performing exceptionally well on multi-class problems. 
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6.7.3.2 05.0 algorithm 


In his book, C4.5: Programs for Machine Learning, J. Ross Quinlan|[3], laid down a set of 
key requirements for using these algorithms for classification task, which are as follows: 


Attribute-value description: All information about one case should 
be expressible in terms of fixed collection of attributes (features) 
and it should not vary from one case to another. 


Predefined classes: As it happens in any supervised learning 
approach, the categories to which cases are to be assigned must 
be predefined. 


Discrete classes: The classes must be sharply delineated; a case 
either does or does not belong to a particular class and there 
must be far more cases than classes. So, clearly, problems 
with continuous response variable are not the right fit for this 
algorithm. 


Sufficient Data: The amount of data required is affected by factors 
such as number of attributes and classes. As these increases, more 
data will be needed to construct a reliable model. 


Logical classification models: The description of the class should 
be logical expressions whose primitives are statements about the 
values of particular attributes. For example, IF Outlook = "sunny" 
AND Windy = "false" THEN class = "Play". 


Here we will use the C5.0 algorithm, on our purchase prediction dataset which is an 
extension of C4.5 for building the decision tree. C4.5 was a collective name given to a set 
of computer programs that constructs a classification models. The following are some 
new features in C5.0 are illustrated in Ross Quinlan's web page (http://www. rulequest. 
com/see5-comparison.html1). 

In the classic book, Experiments in Induction, Hunt et. al.[4] has described many 
implementations of concept learning systems. Here is how Hunt's approach works. 

Given a set of T training observations having C, C, ...,C, classes, at a broad level, 
following are the three possibilities involved in building the tree: 


1. 


All the observations of T belongs to a single class C, the 
decision tree D for T is a leaf identifying class C.. 


T contains no class. C5.0 uses the most frequent class at the 
parent of this node. 


T contains observation which mixture of classes. A node 
condition (test) is chosen based on single attribute (the 
attribute is chosen based on information gain) which 
generates a partitioned set of T,,T,,...,T.. 


The split in third possibility is recursive in the algorithm, which is repeated until 
either all the observations are correctly classified or the algorithm runs out of attribute to 
split. Since this divide and conquer is a greedy approach that looks only at the immediate 
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step to take a decision for split, its possible to end up in situation like overfitting. In 
order to avoid this, a technique called pruning is used, which reduces the overfit and 
generalizes better to unseen data. Fortunately, you don't have to worry about pruning 
since C5.0 algorithm after building the decision tree, iterates back and replace the 
branches that do not increase the information gain. 

Here, we will use our Purchase Preference dataset to build a C5.0 decision tree model 
on for product choice prediction based on the ProductChoice response variable. 


a. Model Building 
model _c50 <-C€5.0(train[ ,c("CustomerPropensity"," 
"MembershipPoints") ], 
train| ,"ProductChoice" |, 
control =C5.0Control(CF =0.001, minCases =2)) 


LastPurchaseDuration’" , 


b. Model Summarysummary(model_c50) 


Call: 
C5.0.default(x = train[, c("CustomerPropensity", 
"LastPurchaseDuration", "MembershipPoints")], y = 

train[, "ProductChoice"], control = C5.0Control(CF = 0.001, 
minCases = 2)) 


C5.0 [Release 2.07 GPL Edition] Sun Oct 02 16:09:05 2016 


Class specified by attribute `outcome' 
Read 28000 cases (4 attributes) from undefined.data 
Decision tree: 


CustomerPropensity in {High,VeryHigh}: 
:...MembershipPoints <= 1: 4 (1264/681) 
MembershipPoints > 1: 
:...LastPurchaseDuration <= 6: 3 (3593/2266) 
LastPurchaseDuration > 6: 
..CustomerPropensity = High: 3 (1665/1083) 
CustomerPropensity = VeryHigh: 4 (2140/1259) 
CustonerPropensity in {Low,Medium, Unknown}: 
..MembershipPoints <= 1: 4 (3180/1792) 
MembershipPoints > 1: 
:...CustomerPropensity = Unknown: 1 (8004/4891) 
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CustomerPropensity in {Low,Medium}: 
:...LastPurchaseDuration <= 2: 1 (2157/1417) 

LastPurchaseDuration > 2: 

:...LastPurchaseDuration > 13: 2 (1083/773) 
LastPurchaseDuration <= 13: 
:...CustomerPropensity = Medium: 3 (2489/1707) 

CustomerPropensity = Low: 
:...MembershipPoints <= 3: 2 (850/583) 
MembershipPoints > 3: 1 (1575/1124) 


Evaluation on training data (28000 cases): 


Decision Tree 


Size Errors 


11 17576(62.8%) << 


(a) (b) (c) (d) <-classified as 
4304 374 1345 1032 (a): class 1 
3374 577 1759 1306 (b): class 2 
2336 484 2691 1394 (c): class 3 
1722 498 1952 2852 (d): class 4 


Attribute usage: 
100.00% CustomerPropensity 


100.00% MembershipPoints 
55.54% LastPurchaseDuration 


Time: 0.1 secs 
plot(model c50) 
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Figure 6-38. C5.0 decision tree on the purchase prediction dataset 


You can experiment with the parameters of C5.0 to see how the decision tree 
changes. As shown in Figure 6-38, if you traverse through any path, it forms one decision 
rule. For example, Rule: CustomerPropensity in {High,VeryHigh}ANDMembershipPoints 
<= l is one path ending in a decision node as shown in Figure 6-38. 


c. Evaluation 

Training set Accuracy = 37% 
library (gmodels) 
purchase pred train <-predict(model c50, train,type ="class") 
CrossTable(train$ProductChoice, purchase pred train, 
prop.chisq =FALSE, prop.c =FALSE, prop.r =FALSE, 
dnn =e('actual default’, ‘predicted default')) 

Cell Contents 


| N | 


Total Observations in Table: 28000 
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| 

actual default | 1 | 2 | 3 | 4 | Row Total | 
-----------==-- |--------|-----=--=-= [ooo een eens |----------- | ----------Ś | 
1 | 4304 | 374 | 1345 | 1032 | 7055 | 
| 0.154 | 0.013 | 0.048 | 0.037 | | 
--------------- |--------|--------=-- |----------- |----------- | ----------Ś- | 
2 | 3374 | 577 | 1759 | 1306 | 7016 | 
| 0.120 | 0.021 | 0.063 | 0.047 | | 
--------------- |--------|-----=----- |----------- |----------- | ----------Ś | 
3 | 2336 | 484 | 2691 | 1394 | 6905 | 
| 0.083 | 0.017 | 0.096 | 0.050 | | 
-----------==-- |--------|-----=--=-- |----------- |----------= | ----------Ś- | 
4 | 1722 | 498 | 1952 | 2852 | 7024 | 
| 0.061 | 0.018 | 0.070 | 0.102 | | 
--------------- |--------|-----=----- |----------- |----------- | ----------Ś- | 
Column Total | 11736 | 1933 | 7747 | 6584 | 28000 | 

| | | | | 


Testing set accuracy = 36% 


purchase pred test <-predict(model c50, test) 
CrossTable(test$ProductChoice, purchase pred test, 
prop.chisq =FALSE, prop.c =FALSE, prop.r =FALSE, 
dnn =e('actual default’, ‘predicted default')) 
purchase pred test <-predict(model c50, test) 
CrossTable(test$ProductChoice, purchase pred test, 
prop.chisq =FALSE, prop.c =FALSE, prop.r =FALSE, 
dnn =e('actual default’, ‘predicted default')) 


Cell Contents 


| N | 


Total Observations in Table: 20002 
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| 

actual default | 1 | 2 | 3 | 4 | Row Total | 
Se [--------|---------=- [once cece cee [eee cee eee e | -------- =-=- | 
1 | 3081 | 279 | 895 | 743 | 4998 | 
| 0.154 | 0.014 | 0.045 | 0.037 | | 
Se [--------|---------=- [once ence cee [eee cere cee e | -------- -=+ | 
2 | 2454 | 321 | 1317 | 903 | 4995 | 
| 0.123 | 0.016 | 0.066 | 0.045 | | 
------------=-- [--------|---------=-|-----------|----------- | -------- -=-= | 
3 | 1730 | 344 | 1843 | 1118 | 5035 | 
| 0.086 | 0.017 | 0.092 | 0.056 | | 
------------=-- [--------|---------=- |-----------|----------- |-------- -=+ | 
4 | 1176 | 349 | 1382 | 2067 | 4974 | 
| 0.059 | 0.017 | 0.069 | 0.103 | | 
------------ =-= [--------|---------=- [once cece cee [eee cere eens | -------- -=+ | 
Column Total | 8441 | 1293 | 5437 | 4831 | 20002 | 

| | | | | 


As you can observe, the accuracy is not exceptionally good. Moreover, given there 
are four classes of data, the accuracy is prone to be even less. The fact that training and 
testing accuracy are almost equal tells us that there is as such no overfitting kind of 
scenario. 


6.7.3.3 Classification and Regression Tree: CART 


CART is a regression tree-based approach and as explained in the section 6.8.1.land its 
uses the sum of squared deviation about the mean (residual sum of square) as the node 
impurity measure. Keep in mind, CART could also be used for classification problems, in 
which case, Gini-Index is a more appropriate choice for impurity measure. Roughly, here 
is a short pseudo code for the algorithm: 


1. Start the algorithm at the root node. 


2. For each attribute X, find the subset S that minimizes the 
residual sum of square (RSS) of the two children and chooses 
the split that gives the maximum information gain. 


3. Check if relative decrease in impurity is below a prescribed 
threshold. 


4. IfYes, splitting stops, otherwise repeat Step 2. 


Let’s see a demonstration of CART using the rpart function, which is available in the 
most commonly used package with rpart. It also provides methods to build decision tree 
like Random Forest, which we will cover later in this chapter. 
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We will also use an additional parameter cp (complexity parameter) in the function 
call, which signifies that any split that does not decrease the overall lack of fit by a factor 
of cp would not be attempted by the model. 


a. Building the model 


CARTModel <-rpart(ProductChoice ~IncomeClass +CustomerPropensity 
+LastPurchaseDuration +MembershipPoints, data=train) 


summary (CARTMode1 ) 
Call: 
rpart(formula = ProductChoice ~ IncomeClass + CustomerPropensity + 
LastPurchaseDuration + MembershipPoints, data = train) 
n= 28000 


CP nsplit rel error xerror xstd 
1 0.09649081 O 1.0000000 1.0034376 0.003456583 
2 0.02582955 1 0.9035092 0.9035092 0.003739335 
3 0.02143710 2 0.8776796 0.8776796 0.003793749 
4 0.01000000 3 0.8562425 0.8562425 0.003833608 


Variable importance 


CustomerPropensity MembershipPoints LastPurchaseDuration 
53 37 8 

IncomeClass 

2 


Node number 1: 28000 observations, complexity param=0.09649081 
predicted class=1 expected loss=0.7480357 P(node) =1 
class counts: 7055 7016 6905 7024 
probabilities: 0.252 0.251 0.247 0.251 
left son=2 (14368 obs) right son=3 (13632 obs) 
Primary splits: 


CustomerPropensity splits as RLRLR, improve=408.0354, (0 missing) 
MembershipPoints < 1.5 to the right, improve=269.2781, (0 missing) 
LastPurchaseDuration < 5.5 to the left, improve=194.7965, (0 missing) 
IncomeClass splits as LRLLLLRRRL, improve= 24.2665, (0 missing) 


Surrogate splits: 

LastPurchaseDuration < 5.5 to the left, agree=0.590, adj=0.159, (0 split) 

IncomeClass splits as LLLLLLLRRR, agree=0.529, adj=0.032, (0 split) 
MembershipPoints < 9.5 to the left, agree=0.514, adj=0.002, (0 split) 
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Node number 7: 2066 observations 
predicted class=4 expected loss=0.5382381 P(node) =0.07378571 
class counts: 291 408 413 954 
probabilities: 0.141 0.197 0.200 0.462 
library (rpart.plot) 
library(rattle) 


fancyRpartPlot (CARTModel1) 


1 
25 25 25 25 
100% 


Lo CustomerPropensity = Low. Medium, Unknow Cm |] 


4 
al 27 27 21 
6% 


MambershipPaoints >= 1.5 


1 a 4 
32 28 23 .17 24 22 13 42 [ 13 21 32 34 
ele 1% 3i% 


Figure 6-39. CART model 





b. Model Evaluation 


Training set Accuracy = 27% 
library(gmodels) 


purchase pred train <-predict(CARTModel, train,type ="class") 
CrossTable(train$ProductChoice, purchase pred train, 
prop.chisq =FALSE, prop.c =FALSE, prop.r =FALSE, 
dnn =e('actual default’, ‘predicted default')) 

Cell Contents 


| N | 


Total Observations in Table: 28000 
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predicted default 


| 

actual default | 1 | 3 | 4 | Row Total | 
--------------- |-----------|---------=- | ----------- |---------= + 
1| 4253 | 1943 | 859 | 7055 | 
| 0.152 | 0.069 | 0.031 | | 
--------------- |-----------|----------- | ----------- |----------Ś 
2 | 3452 | 2629 | 935 | 7016 | 
| 0.123 | 0.094 | 0.033 | | 
--------------- |---------=-|---------=- [eee eee neces |----------Ś 
3 | 2384 | 3842 | 679 | 6905 | 
| 0.085 | 0.137 | 0.024 | | 
--------------- |-----------|---------=- |----------- |---------=Ś 
4 | 1901 | 3152 | 1971 | 7024 | 
| 0.068 | 0.113 | 0.070 | | 
--------------- |-----------|----------- |----------- |----------Ś 
Column Total | 11990 | 11566 | 4444 | 28000 | 

| | | | 


It looks like a poor model for this dataset. If you observe, the training model doesn't 
even predict any instance of class 3. We will skip the testing set evaluation. You are 
encouraged to try the housing dataset used in linear regression to appreciate CART 
algorithm much better. 


6.7.3.4 Chi-Square Automated Interaction Detection: CHAID 


In this method, the R code being used for demonstration accepts only nominal or ordinal 
categorical predictors. For each predictor variable, the algorithm works by merging non- 
significant categories, wherein each final category of X will result in one child node, if the 
algorithm chooses X (based on adjusted p-value) to split the node. 

The following algorithm is borrowed from the documentation of the CHAID package 
in R and the classic paper by G. V. Kass (1980), called “An Exploratory Technique for 
Investigating Large Quantities of Categorical Data.” 


1. IfXhas one category only, stop and set the adjusted p-value to be 1. 
2. IfX has two categories, go to Step 8. 


3. Else, find the allowable pair of categories of X (an allowable 
pair of categories for ordinal predictor is two adjacent 
categories, and for nominal predictor is any two categories) 
that is least significantly different (i.e., the most similar). 

The most similar pair is the pair whose test statistic gives the 
largest p-value with respect to the dependent variable Y. How 
to calculate p-value under various situations will be described 
in later sections. 
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4. For the pair having the largest p-value, check if its p-value is 
larger than a user-specified alpha-level alphaz2. If it does, this 
pair is merged into a single compound category. Then a new 
set of categories of X is formed. If it does not, then go to Step 7. 


5. (Optional) If the newly formed compound category consists 
of three or more original categories, then find the best binary 
split within the compound category which p-value is the 
smallest. Perform this binary split if its p-value is not larger 
than an alpha-level alpha3. 


6. Goto Step 2. 


7. (Optional) Any category having too few observations (as 
compared with a user-specified minimum segment size) is 
merged with the most similar other category as measured by 
the largest of the p-values. 


8. The adjusted p-value is computed for the merged categories 
by applying Bonferroni adjustments that are to be discussed 
later. 


Splitting: The best split for each predictor is found in the merging step. The splitting 
step selects which predictor to be used to best split the node. Selection is accomplished 
by comparing the adjusted p-value associated with each predictor. The adjusted p-value 
is obtained in the merging step. 


1. Select the predictor that has the smallest adjusted p-value 
(i.e., most significant). 


2. Ifthis adjusted p-value is less than or equal to a user-specified 
alpha-level alpha4, split the node using this predictor. 
Otherwise, do not split and the node is considered as a 
terminal node. 


Stopping: The stopping step checks if the tree growing process should be stopped 
according to the following stopping rules. 


1. Ifanode becomes pure; that is, all cases in a node have 
identical values of the dependent variable, the node will not 
be split. 


2. Ifall cases in a node have identical values for each predictor, 
the node will not be split. 


3. Ifthe current tree depth reaches the user specified maximum 
tree depth limit value, the tree growing process will stop. 


4. Ifthe size of a node is less than the user-specified minimum 
node size value, the node will not be split. 
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5. Ifthe split of anode results in a child node whose node size 
is less than the user-specified minimum child node size 
value, child nodes that have too few cases (as compared with 
this minimum) will merge with the most similar child node 
as measured by the largest of the p-values. However, if the 
resulting number of child nodes is 1, the node will not be split. 


6. Ifthe trees height is a positive value and equals the max 
height. 


Let’s now see a demonstration on our purchase prediction dataset. 


Note For using the code in this section, use the following steps: 


1. Download the zip or . tar.gz file according to your machine 
from https://r-forge.r-project.org/R/?group id=343. 


2. Extract the contents of the compressed file into a folder 
named CHAID and place it in the installation folder of R. The 
installation folder might look something like C: \Program 
Files\R-3.2.2\library. 


3. That's it. You are ready to call the library (CHAID) inside your 
R script. 


a. Building the model 


Since CHAID takes all categorical inputs, we are using the attributes 
CustomerPropensity and IncomeClass as predictor variables. 


library ("CHAID" ) 
Loading required package: partykit 
Loading required package: grid 
ctrl <-chaid:control(minsplit =200, minprob =0.1) 
CHAIDModel <-chaid(ProductChoice ~CustomerPropensity +IncomeClass, 
data = train, control = ctrl) 
print (CHAIDMode1) 


Model formula: 
ProductChoice ~ CustomerPropensity + IncomeClass 


Fitted party: 

[1] root 

| [2] CustomerPropensity in High 

| [3] IncomeClass in , 1, 2, 3, 9: 2 (n = 169, err = 68.6%) 
| [4] IncomeClass in 4: 3 (n = 628, err = 65.0%) 

| [5] IncomeClass in 5: 4 (n = 1286, err = 70.2%) 

| [6] IncomeClass in 6: 3 (n = 1192, err = 67.0%) 
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[7] IncomeClass in 7: 3 (n = 662, err = 63.4%) 
[8] IncomeClass in 8: 4 (n = 222, err = 59.9%) 
9] CustomerPropensity in Low: 2 (n = 4778, err = 72.0%) 
10] CustomerPropensity in Medium 
[11] IncomeClass in , 1, 2, 3, 4, 5, 7: 3 (n = 3349, err 
[12] IncomeClass in 6, 8: 4 (n = 1585, err = 71.0%) 
[13] IncomeClass in 9: 3 (n = 36, err = 44.4%) 


| 

| 

[ 

[ 

| 73.5%) 
| 

| 

[14] CustomerPropensity in Unknown 
| 

| 

| 

| 

[ 

| 

| 

| 


[15] IncomeClass in : 2 (n = 18, err = 0.0%) 
[16] IncomeClass in 1: 4 (n = 15, err = 53.3%) 
[17] IncomeClass in 2, 3, 4, 5, 6, 7, 8: 1 (n = 9524, err = 63.6%) 
[18] IncomeClass in 9: 1 (n = 33, err = 39.4%) 
19] CustomerPropensity in VeryHigh 
[20] IncomeClass in , 1, 3, 4, 5, 6, 9: 3 (n = 3484, err 
[21] IncomeClass in 2, 8: 4 (n = 268, err = 48.5%) 
[22] IncomeClass in 7: 4 (n = 751, err = 58.2%) 


64.5%) 


Number of inner nodes: 5 
Number of terminal nodes: 17 
#p lot (CHAIDMode1 ) 


b. Model Evaluation 

The accuracy has no major improvement compared to C5.0 or ID3. However, it’s 
interesting to see the accuracy is close to what C5.0 algorithm in spite of using only two 
attributes. 

Training set accuracy: 32% 
library (gmodels) 
purchase pred train <-predict(CHAIDModel, train) 
CrossTable(train$ProductChoice, purchase pred train, 
prop.chisq =FALSE, prop.c =FALSE, prop.r =FALSE, 
dnn =e('actual default', ‘predicted default')) 

Cell Contents 


| N | 
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Total Observations in Table: 28000 


| 

actual default | 1| 2 | 3 | 4 | Row Total | 
------------- =- |-------- |----------- |----------- [eee cence ee |----------Ś | 
1 | 3487 | 1367 | 1610 | 591 | 7055 | 
| 0.125 | 0.049 | 0.058 | 0.021 | | 
--------------- |-------- |----------- |--------- -= |----------- |----------Ś | 
2 | 2617 | 1410 | 2047 | 942 | 7016 | 
| 0.093 | 0.050 | 0.073 | 0.034 | | 
------------- =- |-------- |----------- [once ence cee [eee cere noes |----------Ś | 
3 | 1669 | 1031 | 3001 | 1204 | 6905 | 
| 0.060 | 0.037 | 0.107 | 0.043 | | 
--------------- |-------- |----------- |--------- =- [----------- |------- -=+ 
4 | 1784 | 1157 | 2693 | 1390 | 7024 | 
| 0.064 | 0.041 | 0.096 | 0.050 | | 
--------------- |-------- |----------- |----------- [----------= |----------Ś | 
Column Total | 9557 | 4965 | 9351 | 4127 | 28000 | 

| | | | | 


Testing set accuracy: 32% 
purchase pred test <-predict(CHAIDModel, test) 
CrossTable(test$ProductChoice, purchase pred test, 
prop.chisq =FALSE, prop.c =FALSE, prop.r =FALSE, 
dnn =e('actual default’, ‘predicted default')) 
Cell Contents 


| N | 


Total Observations in Table: 20002 
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actual default 


predicted default 


Hah Loerie leche ina 


2 | 3 | 4 | Row Total | 
----------- |----------=|----------- |----------Ś 
1003 | 1048 | 454 | 4998 | 
0.050 | 0.052 | 0.023 | | 
----------- [----------=|----------- |----------Ś 
901 | 1502 | 663 | 4995 | 
0.045 | 0.075 | 0.033 | | 
---------=- |----------=|----------- |----------Ś 
747 | 2034 | 991 | 5035 | 
0.037 | 0.102 | 0.050 | | 
----------- [----------=|----------- |----------Ś 
776 | 2008 | 872 | 4974 | 
0.039 | 0.100 | 0.044 | | 
---------=- [----------=|----------- |----------Ś 
3427 | 6592 | 2980 | 20002 | 

| | | 


a WAN | ma i N d li i | i a BAN | í all. l al LI - EEN) 
EEE 1744 173 i274: E EF 23 123 


Figure 6-40. CHAID decision tree 


The accuracy has no major improvement compared to C5.0 or ID3. However, it’s 
interesting to see the accuracy is close to what C5.0 algorithm in spite of using only two 


attributes. 


With 37% and 36% training and test set accuracy respectively, the C5.0 algorithm 
seems to have done the best among all the others. Although this accuracy might 
not be sufficient for using in any practical application, this example gives sufficient 
understanding of how the decision tree algorithms work. We encourage you to create 
a subset of the given dataset with only two classes and then see how each of these 
algorithms performs. 
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6.7.4 Ensemble Trees 


Ensemble models in machine learning are great way to improve your model accuracy 

by many folds. The best Kaggle competition-winning ML algorithms are predominately 
using the ensemble approach. The idea is simple, instead of training one model on a set 
of observations, we use the power of multiple models (or multiple iteration of the same 
model on different subset of training data) combined together to train on the same set 

of observations. Chapter 8 takes a detailed approach on improving model performance 
using ensembles; however, we will keep our focus on ensemble models based on decision 
tree in this section. 


6.7.4.1 Boosting 


Boosting is an ensemble meta-algorithm in ML that helps in reducing bias and variance 
and fits a sequence of weak learners on different weighted training observations (more on 
this in Chapter 8). 

We will demonstrate this technique using C5.0 Ensemble model, which is an 
extension of what we discussed earlier in this chapter by adding a parameter trials = 10 in 
the C50 function call which means, perform 10 boosting iterations in the model building 
process. 


library (gmodels) 
purchase pred train <-predict(ModelC50 boostcv1i0, train) 
CrossTable(train$ProductChoice, purchase pred train, 
prop.chisq =FALSE, prop.c =FALSE, prop.r =FALSE, 
dnn =e('actual default’, ‘predicted default')) 

Cell Contents 


| N | 


Total Observations in Table: 28000 
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| 

actual default | 1 | 2 | 3 | 4 | Row Total | 
--------------- [-------=|-+---+-----|-----------|----------- |----------Ś | 
1 | 3835 | 903 | 1117 | 1200 | 7055 | 
| 0.137 | 0.032 | 0.040 | 0.043 | | 
--------------- [-------=|-----------|-----------|----------- |----------Ś | 
2 | 2622 | 1438 | 1409 | 1547 | 7016 | 
| 0.094 | 0.051 | 0.050 | 0.055 | | 
--------------- [-------=|-----------|-----------|----------- |----------Ś | 
3 | 1819 | 812 | 2677 | 1597 | 6905 | 
| 0.065 | 0.029 | 0.096 | 0.057 | | 
--------------- [------=|-+---------|-----------|----------- |----------Ś | 
4 | 1387 | 686 | 1577 | 3374 | 7024 | 
| 0.050 | 0.024 | 0.056 | 0.120 | | 
--------------- [-------=|-+---+-----|-----------|----------- |----------Ś | 
Column Total | 9663 | 3839 | 6780 | 7718 | 28000 | 

| | | | | 


Testing set accuracy: 
purchase pred test <-predict(ModelC50 boostcv10, test) 
CrossTable(test$ProductChoice, purchase pred test, 
prop.chisq =FALSE, prop.c =FALSE, prop.r =FALSE, 
dnn =e('actual default', ‘predicted default')) 

Cell Contents 


| N | 


Total Observations in Table: 20002 
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| 

actual default | 1 | 2 | 3 | 4 | Row Total | 
--------------- [-------=[--=-------- |---------=- |----------- |----------Ś | 
1 | 2556 | 769 | 770 | 903 | 4998 | 
| 0.128 | 0.038 | 0.038 | 0.045 | | 
--------------- [-------=[--=-----+-- |-------+-- |----------- |----------Ś | 
2 | 2022 | 748 | 1108 | 1117 | 4995 | 
| 0.101 | 0.037 | 0.055 | 0.056 | | 
--------------- [-------=[--+-------- |--------+-- |----------- |----------Ś | 
3 | 1406 | 701 | 1540 | 1388 | 5035 | 
| 0.070 | 0.035 | 0.077 | 0.069 | | 
-------------=- oor Coen Coe |----------- |----------Ś | 
4 | 970 | 548 | 1201 | 2255 | 4974 | 
| 0.048 | 0.027 | 0.060 | 0.113 | | 
--------------- eee Coen Coe |----------- |----------Ś | 
Column Total | 6954 | 2766 | 4619 | 5663 | 20002 | 

| | | | | 


Though the training set accuracy increase to 40%, the testing set accuracy has come 
down, indicating a slight overfitting in this model. 


6.7.4.2 Bagging 


This is another class of ML meta-algorithm, also known as bootstrap aggregating. It again 
helps in reducing the variance and overfitting on training observation. Explained next is 
the process of bagging: 


1. Given a set of N training observations, generate m new 
training sets D, each of size n where (n <<N ) by uniform 
sampling with replacement. This sampling is called bootstrap 
sampling. (Refer to Chapter 3 for more details.) 


2. Using these m training set, m models are fitted and the 
outputs are combined either by averaging the output 
(regression) or majority voting (for classification). 


Certain version of algorithm could have even a sample set of smaller number of 
attributes than the original dataset like Random Forest. Let’s use the Bagging CART and 
Random Forest algorithms here to demonstrate. 
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Note The following code might take a significant amount of time and RAM memory. If 
you want, you can reduce the size of training set for quicker execution. 


6.7.4.2.1 Bagging CART 


control <-trainControl(method="repeatedcv", number=5, repeats=2) 


# Bagged CART 
set.seed(100) 
CARTBagModel <-train(ProductChoice ~CustomerPropensity +LastPurchaseDuration 
+MembershipPoints, data=train, method="treebag", trControl=control) 
Loading required package: ipred 
Loading required package: plyr 
Warning: package 'plyr' was built under R version 3.2.5 
Loading required package: e1071 
Warning: package ‘e1071' was built under R version 3.2.5 


Training set accuracy = 42% 

Testing set accuracy = 34% 

Though the training set accuracy increase to 42%, the testing set accuracy has come 
down, indicating a slight overfitting in this model. 

Training set accuracy: 


library (gmodels) 
purchase pred train <-predict(CARTBagModel, train) 
CrossTable(train$ProductChoice, purchase pred train, 
prop.chisq =FALSE, prop.c =FALSE, prop.r =FALSE, 
dnn =e('actual default', ‘predicted default')) 

Cell Contents 


| N | 


Total Observations in Table: 28000 
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| 
actual default | 1 | 
pee eee | 
1 | 3761 | 
| 0.134 | 
-------= -=== ==- |- 
2 | 2358 | 
| 0.084 | 
-------= -=== ==- J-----=--| 
3 | 1685 | 
| 0.060 | 
-------= -=== ==- |- 
4 | 1342 | 
| 0.048 | 
-------= -=== |---| 
Column Total | 9146 | 
| | 


Testing set accuracy: 


purchase pred test <-predict(CARTBagModel, test) 


CrossTable(test$ProductChoice, purchase pred test, 


prop.chisq =FALSE, prop.c =FALSE, prop.r =FALSE, 
dnn =e('actual default’, ‘predicted default')) 


Cell Contents 


| N | 


Total Observations in Table: 
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predicted default 


| 

actual default | 1 | 2 | 3 | 4 | Row Total | 
---------=-=-=- eee Coenen oe [++--++ +--+- Ps 
1 | 2331 | 1071 | 657 | 939 | 4998 | 
| 0.117 | 0.054 | 0.033 | 0.047 | | 
---------=-=--- eee Cee oe +=- -+--+ -+--+ Ps 
2 | 1844 | 1012 | 928 | 1211 | 4995 | 
| 0.092 | 0.051 | 0.046 | 0.061 | | 
---------=-=--- eee Coenen ee be as 
3 | 1348 | 875 | 1320 | 1492 | 5035 | 
| 0.067 | 0.044 | 0.066 | 0.075 | | 
---------=-=-=+ eee Coenen eee bee Ps 
4| 979 | 759 | 1066 | 2170 | 4974 | 
| 0.049 | 0.038 | 0.053 | 0.108 | | 
--------------- eee Coenen ee oe Ps 
Column Total | 6502 | 3717 | 3971 | 5812 | 20002 | 

| | | | | 


Testing set Accuracy = 34% 


Though the training set accuracy increases to 42%, the testing set accuracy has come 
down, indicating a slight overfitting in this model once again. 


6.7.4.2.2 Random Forest 


Random Forest is one of the most popular decision tree-based ensemble models. The 
accuracy of these models tends to be higher than most of the other decision trees. Here is 
a broad summary of how the Random Forest algorithm works: 


1. Let N= Number of observations, n = number of decision trees 
(user input), and M = Number of variables in the dataset. 


2. Choose a subset of m variables from M, where m << M, and 
build n decision trees using a random set of m variable. 


3. Grow each tree as large as possible. 
4. Use majority voting to decide the class of the observation. 


A randomly chosen subset of N observations without replacement (normally 2/3) is 
used to build each decision tree. Now let’s use our purchase preferences dataset again for 
demonstration using R. 
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# Random Forest 
set.seed(100) 


rfModel <-train(ProductChoice ~CustomerPropensity +LastPurchaseDuration 
+MembershipPoints, data=train, method="rf", trControl=control) 

Loading required package: randomForest 

randomForest 4.6-10 

Type rfNews() to see new features/changes/bug fixes. 


Attaching package: ‘randomForest' 
The following object is masked from ‘package:ggplot2': 


margin 


Training set Accuracy = 41% 

Testing set Accuracy = 36% 

This model seems to have the best training and testing accuracy so far in all our trials 
of other models. Observe that Random Forest has done slightly well in tackling the overfit 
problem compared to CART. 

Training set accuracy: 


library (gmodels) 
purchase pred train <-predict(rfModel, train) 
CrossTable(train$ProductChoice, purchase pred train, 
prop.chisq =FALSE, prop.c =FALSE, prop.r =FALSE, 
dnn =e('actual default’, ‘predicted default')) 

Cell Contents 


| N | 


Total Observations in Table: 28000 
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| 
actual default | 1 | 2 | 3 | 
---------=-=-=- |--------|-----------|----------Ś | 
1 | 4174 | 710 | 1162 | 
| 0.149 | 0.025 | 0.042 | 
---------=-=--- |--------|---------=-|----------Ś | 
2 | 2900 | 1271 | 1507 | 
| 0.104 | 0.045 | 0.054 | 
---------=-=--- |--------|---------=-|----------Ś | 
3 | 1970 | 701 | 2987 | 
| 0.070 | 0.025 | 0.107 | 
---------=-=-=+ |--------|-----------|----------Ś | 
4 | 1564 | 608 | 1835 | 
| 0.056 | 0.022 | 0.066 | 
--------------- |--------|-----------|----------Ś | 
Column Total | 10608 | 3290 | 7491 | 
| | | | 


Testing set accuracy: 
purchase pred test <-predict(rfModel, test) 
CrossTable(test$ProductChoice, purchase pred test, 
prop.chisq =FALSE, prop.c =FALSE, prop.r =FALSE, 
dnn =e('actual default', ‘predicted default')) 
Cell Contents 


| N | 


Total Observations in Table: 20002 
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| 

actual default | 1 | 2 | 3 | 4 | Row Total | 
---------- -=-= Ą |-=----2~ |-22--- 2222+ |-----+--- += |----------- |----------Ś | 
1 | 2774 | 611 | 845 | 768 | 4998 | 
| 0.139 | 0.031 | 0.042 | 0.038 | | 
--------------Ą |--------|---------- |-+---+--+ +- |--------+-- |----------Ś | 
2 | 2210 | 639 | 1194 | 952 | 4995 | 
| 0.110 | 0.032 | 0.060 | 0.048 | | 
--------------- eee Rene re ee oe 
3 | 1531 | 602 | 1774 | 1128 | 5035 | 
| 0.077 | 0.030 | 0.089 | 0.056 | | 
---------- -=-= - |------2- |-22--- 2222+ |--+----- +- |-------- +=- |----------Ś | 
4 | 1100 | 462 | 1389 | 2023 | 4974 | 
| 0.055 | 0.023 | 0.069 | 0.101 | | 
--------------- |--------|--------==- |-------+ += |----------- |----------Ś | 
Column Total | 7615 | 2314 | 5202 | 4871 | 20002 | 

| | | | | 


There is another approach called stacking, which we cover in much greater detail in 
Chapter 8. 

This model seems to have the best training and testing accuracy so far in all our trials 
of other models. Observe that Random Forest has done slightly well in tackling the overfit 
problem compared to CART. 

In overall, under decision tree, the ensemble approach seems to do the best in the 
predicting the product preferences. 


6.7.5 Conclusion 


These supervised learning algorithms have a wide-spread adaptability in industry and 
many research work. The underlying design of decision tree makes it easy to interpret and 
the model is very intuitive to connect with the real-world problem. The approaches like 
Boosting and Bagging have given rise to high accuracy models based on decision tree. In 
particular, Random Forest is now one of the widely used model for many classification 
problems. 

We presented a detailed discussion of decision tree where we started with the very 
first decision tree models like ID3 and then went on to present the contemporary Bagging 
CART and Random Forest algorithms as well. 

In the next section, we will discuss our first probabilistic model in this book. 

The Bayesian models are easy to implement and powerful enough to capture a lot of 
information from a given set of observations and its class labels. 
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6.8 The Naive Bayes Method 


Naive Bayes is a probabilistic model-based machine learning algorithm that was used 

in text categorization in its earlier use case. These methods fall in broad category of 
Bayesian algorithms in machine learning. Applications like document categorization 
and spam filters for e-mails were the first few areas where Naive Bayes proved to be really 
effective as a classifier algorithm. The name of the algorithm is derived from the fact that 
it relies on one of the most powerful concepts in probability theory, the Bayes Theorem, 
Bayes rule, or Bayes formula. In the coming sections, we will formally introduce the 
background necessary for understanding Naive Bayes and demonstrate its application to 
our Purchase Prediction dataset. 


6.8.1 Conditional Probability 


Conditional probability plays a significant role in ascertaining the impact of one event on 
another. It could increase or decrease the probability of an event if it’s known that another 
event has an influence on the event under study. Recall our Facebook nearby feature 
discussion in Chapter 1, where we computed this probability: 


P( Visit Cineplex | Nearby ) 


In other words, how does the information, “your friend is nearby the cineplex” affect 
the probability that you will visit the cineplex. 


6.8.2 Bayes Theorem 


The Bayes theorem (or Bayes rule or Bayes formula) defines the conditional probability 
between two events as 


P(ap) Eo) 


where 

P(A) is prior probability, 

P( AB) is posterior probability and its read as rthe probability of the event A happening 
given the event B 

P(B) is marginal likelihood 

P(BIA) is likelihood 

P(B|A)-P(A) could also be thought as joint probability, which denotes the probability of 


A intersection B; in other words, the probability of both event A and B happening 
together. 
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Rearranging the Bayes theorem, we could write it as 








P(BIA) 
P( A|B)= P(A 
P(BIA) 
where the term P(B) signifies the impact event B has on the probability of A 


happening. 
This will form the core of our Naive Bayes algorithm. So, before we get there, let’s 
understand briefly the three terms discussed in the Bayes Theorem. 


6.8.3 Prior Probability 


Prior probability or priors signifies the certainty of an event occurring before some 
evidence is considered. Taking the same Facebook Nearby feature, what’s the probability 
of your friend visiting the cineplex if you don’t know anything about his current location. 


6.8.4 Posterior Probability 


The probability of the event A happening conditioned on another event gives us the 
posterior probability. So, in the Facebook Nearby feature, how his probability of visiting 
cineplex changes if we knew your friend is within 1 miles of the cineplex (defined as 
nearby). Such additional evidences are useful in increasing the certainty of a particular 
event. We will exploit this very fundamental in designing the Naive Bayes algorithm. 


6.8.5 Likelihood and Marginal Likelihood 


If we slightly modify the Table 1-2 in Chapter 1, where you see a two-way contingency 
table (also called frequency table) for the Facebook Nearby example and transform 

it into a Likelihood table as shown in Table 6-6, where each entry in the cells are now 
conditional probability 


Table 6-6. A Likelihood Table 


Nearby | rar JM 
Toa | azs | 13/25 | 25 
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Looking at the Table 6-6, it’s now easy to compute the marginal likelihood P(Nearby) 
as 12/25 = 0.48. And the likelihood P (Nearby | Visit Cineplex) as 10/12 = 0.83. Marginal 
likelihood as you can observe, doesn't depend on the other event. 


6.8.6 Naive Bayes Methods 


So, putting all these together, we get the final form of Naive Bayes 


P( Nearby | Visit Cineplex )* P (Visit Cineplex) 


P( Visit Cineplex | Nearby) E P(N b ) 
earby 


Further generalizing this, let’s suppose we have a dataset, represented by vector 
x=(x,,...,.X, ) with n features, independent of each other (This is a strong assumption in 


Naive Bayes, and any dataset violating this property will perform poorly with Naive 
Bayes), then a given observation could be classified with a probability 
p(C,,|x,,...,x, ) into any of the K classes C,. 


So, now using the Bayes Theorem, we could write the conditional probability as 


p(C, ) p(x|C, ) 
p(x) 


p(C,|x)= 


The numerator, P (Cy) p(xIC,) , which represents the joint probability could be 


expanded using chain rule. However, we will leave that discussion to some advanced text 
on this topic. 
At this point, we have discussed how the Bayes theorem server as a powerful way 
to model a real-world problem. Further, it’s possible to show Naive Bayes could be very 
effective if the likelihood tables are precomputed and a real-time implementation will just 
have to do a table lookup to do some quick computation. Naive Bayes is elegantly simple 
yet powerful when assumptions are seriously considered while modeling the problem. 
Now, let’s apply this technique in our Purchase Preference dataset and see what 
we get. 


a. Data Preparation 


library (data.table) 

library(splitstackshape) 

library (e1071) 

str(Data Purchase) 

Classes ‘data.table’ and ‘data.frame': 500000 obs. of 12 variables: 

$ CUSTOMER _ID : chr "000001" "000002" "000003" "000004" ... 
$ ProductChoice > int 2323232223... 
$ MembershipPoints : int 6242665953... 
$ ModeOfPayment : chr “MoneyWallet" "CreditCard" "“MoneyWallet" 


“MoneyWallet" ... 
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$ ResidentCity : chr “Madurai” "Kolkata" “Vijayawada” “Meerut” ... 
$ PurchaseTenure : int 441063313198... 

$ Channel : chr "Online" "Online" "Online" "Online" ... 

$ IncomeClass Che 4" T 5 A ewi 

$ CustomerPropensity : chr “Medium” "VeryHigh" "Unknown" "Low" 

$ CustomerAge > int 55 75 34 26 38 71 72 27 33 29... 

$ MartialStatus > int 0000100001... 


$ LastPurchaseDuration: int 4 1515661054156... 
- attr(*, ".internal.selfref")=<externalptr> 
#Check the distribution of data before grouping 
table(Data_ Purchase$ProductChoice) 


1 2 3 4 
106603 199286 143893 50218 
#Pulling out only the relevant data to this chapter 
Data Purchase <-Data Purchase[,.(CUSTOMER ID,ProductChoice,MembershipPoints, 
IncomeClass ,CustomerPropensity, LastPurchaseDuration) | 


#Delete NA from subset 
Data Purchase <-na.omit(Data Purchase) 
Data Purchase$CUSTOMER ID <-as.character(Data Purchase$CUSTOMER ID) 


#Stratified Sampling 
Data Purchase Model<-stratified(Data Purchase, group=e("ProductChoice"),size 
=10000, replace=FALSE) 


print("The Distribution of equal classes is as below") 
[1] "The Distribution of equal classes is as below" 
table(Data Purchase Model$ProductChoice) 


1 2 3 4 
10000 10000 10000 10000 
Data Purchase Model$ProductChoice <-as.factor(Data Purchase 
Model$ProductChoice) 
Data Purchase Model$IncomeClass <-as.factor(Data Purchase Model$IncomeClass) 
Data Purchase Model$CustomerPropensity <-as.factor(Data Purchase 
Model$CustomerPropensity) 


set.seed(917); 

train <-Data Purchase Model[sample(nrow(Data Purchase Model) ,size=nrow(Data_ 
Purchase Model)*(0.7), replace =TRUE, prob =NULL), ] 

train <-as.data.frame(train) 


test <-as.data.frame(Data Purchase Model[!(Data Purchase Model$CUSTOMER ID 
*instrain$CUSTOMER ID), |) 
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b. 


Naive Bayes Model 


model naiveBayes <-naiveBayes(train|,¢(3,4,5)], train[,2]) 
model _naiveBayes 


Naive Bayes Classifier for Discrete Predictors 


Call: 
naiveBaye 


A-priori probabilities: 


train[, 2 
1 


s.default(x = 


2 


train[, c(3, 4, 5)], y = train[, 21) 


3 


4 


0.2519643 0.2505714 0.2466071 0.2508571 


Conditional probabilities: 
MembershipPoints 


train[, 2 


train[, 2 


train[, 2 


train[, 2 
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[51] 


[52] 


366832 2.385888 


518320 2.391260 
659596 2.520176 


| 

1 4. 

2 4.212087 2.354063 
3 4. 

4 3 


IncomeClass 
] 1 
1 0.000000000 0.001417434 
2 0.002993158 0.001710376 
3 0.000000000 0.001158581 
4 0.000000000 0.001566059 
IncomeClass 
] 5 6 
1 0.337774628 0.265910702 
2 0.328962372 0.265250855 
3 0.325416365 0.275452571 
4 0.318479499 0.280466970 
CustomerPropensity 
] High Low 


1 0.09992913 0.17987243 0O. 
2 0.14438426 0.18258267 0. 
3 0.18580739 0.13714699 0. 
4 0.17283599 0.14635535 O. 


2 


-001842665 
- 002423033 
.001737871 
.001850797 


7 


. 142735648 
. 143529076 
. 148153512 
. 161161731 


Medium 


3 4 
0.033451453 0.171651311 
0.032354618 0.173175599 
0.029543809 0.166980449 
0.019219818 0.151480638 

8 9 
0.040113395 0.005102764 
0.046465222 0.003135690 
0.046777697 0.004779146 
0.060791572 0.004982916 
Unknown  VeryHigh 


16328845 0.49666903 0.06024096 
17901938 0.36944128 0.12457241 
19608979 0.25561188 0.22534395 
18180524 0.26153189 0.23747153 


CHAPTER 6» MACHINE LEARNING THEORY AND PRACTICES 


c. Model Evaluation 
Training Set Accuracy : 41% 


model naiveBayes pred <-predict(model naiveBayes, train) 
library (gmodels) 


CrossTable(model naiveBayes pred, train[,2], 
prop.chisq =FALSE, prop.t =FALSE, 
dnn =e('predicted', ‘actual')) 


Cell Contents 


Total Observations in Table: 28000 


| actual 

predicted | 1 | 2 | 3 | 4 | Row Total | 
------------- [--------=|-+--------= |----------- |--------=-- |----------Ś | 
1 | 4016 | 3077 | 2187 | 2098 | 11378 | 
| 0.353 | 0.270 | 0.192 | 0.184 | 0.406 | 
| 0.569 | 0.439 | 0.317 | 0.299 | | 
------------- Peon Coenen Coe oe |--------- -+ 
2 | 622 | 702 | 500 | 489 | 2313 | 
| 0.269 | 0.304 | 0.216 | 0.211 | 0.083 | 
| 0.088 | 0.100 | 0.072 | 0.070 | | 
------------- Pee Coenen Coe oe |--------- -+ 
3 | 1263 | 1635 | 2336 | 1890 | 7124 | 
| 0.177 | 0.230 | 0.328 | 0.265 | 0.254 | 
| 0.179 | 0.233 | 0.338 | 0.269 | | 
------------- [-------==|-+---=----= [--+-+------ |--------- +- |--------- -+ 
4 | 1154 | 1602 | 1882 | 2547 | 7185 | 
| 0.161 | 0.223 | 0.262 | 0.354 | 0.257 | 
| 0.164 | 0.228 | 0.273 | 0.363 | | 
------------- [-------==|-+-----=--= [--------=-- |--------- +- |--------- -+ 
Column Total | 7055 | 7016 | 6905 | 7024 | 28000 | 
| 0.252 | 0.251 | 0.247 | 0.251 | | 

| | | | | 
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Testing Set Accuracy: 34% 


model _naiveBayes pred <-predict(model naiveBayes, test) 
library (gmodels) 


CrossTable(model naiveBayes pred, test[,2], 
prop.chisq =FALSE, prop.t =FALSE, 
dnn =e('predicted', ‘actual')) 


Cell Contents 
| N | 


ow Total | 
ol Total | 


A 


Total Observations in Table: 20002 


| actual 

predicted | 1 | 2 | 3 | 4 | Row Total | 
------------- [-------==|-+---=-=--= [--+-------- |----------- |--------- -+ 
1 | 2823 | 2155 | 1537 | 1493 | 8008 | 
| 0.353 | 0.269 | 0.192 | 0.186 | 0.400 | 
| 0.565 | 0.431 | 0.305 | 0.300 | | 
------------- Peed Coe Coenen |--------- -+ |--------- -+ 
2 | 496 | 548 | 388 | 407 | 1839 | 
| 0.270 | 0.298 | 0.211 | 0.221 | 0.092 | 
| 0.099 | 0.110 | 0.077 | 0.082 | | 
------------- Peed Coenen Coen |----------- |--------- -+ 
3 | 885 | 1164 | 1746 | 1358 | 5153 | 
| 0.172 | 0.226 | 0.339 | 0.264 | 0.258 | 
| 0.177 | 0.233 | 0.347 | 0.273 | | 
------------- Peon Coenen Coren eee |--------- -+ 
4 | 794 | 1128 | 1364 | 1716 | 5002 | 
| 0.159 | 0.226 | 0.273 | 0.343 | 0.250 | 
| 0.159 | 0.226 | 0.271 | 0.345 | | 
------------- [-------==|-+----=---= [eens cece ee |---------+- |--------- -+ 
Column Total | 4998 | 4995 | 5035 | 4974 | 20002 | 
| 0.250 | 0.250 | 0.252 | 0.249 | | 

| | | | | 


There is a significant difference in the training and testing set accuracy, indicating a 
possibility of overfitting. 
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6.8.7 Conclusion 


The techniques discussed in this section are based on probabilistic models, commonly 
known as Bayesian models. These models are easy to interpret and have been quite 
popular in applications like Spam filtering and text classifications. Bayesian models 
offer the flexibility to add data incrementally to allow easy model updating as and when 
a new set of observation arrives. For this reason, probabilistic approaches like Bayesian 
have found their application in many real-world use cases. The model’s adaptability to 
changing data is far easier than the other models. 

In the next section, we will take up the discussion of the unsupervised class of 
learning algorithms that are useful in many practical applications where availability of 
labeled data is less or not present at all. These algorithms are most famously regarded as 
pattern recognition algorithms and work based on certain similarity or distance-based 
methods. 


6.9 Cluster Analysis 


Clustering analysis involves grouping a set of objects into meaningful and useful clusters 
such that the objects within the cluster are homogeneous and the same objects are 
heterogeneous to objects of other clusters. The guiding principle of clustering analysis 
remains similar across different algorithms as minimizing intragroup variability and 
maximizing intergroup variability by some metric, e.g., distance connectivity, mean- 
variance, etc. 

Clustering does not refer to specific algorithms but it’s a process to create groups 
based on similarity measure. Clustering analysis use unsupervised learning algorithm 
to create clusters. Cluster analysis is sometimes presented as part of broader features 
analysis of data. Some researchers break the feature discovery exercise into cluster 
analysis of data: 


e Factor analysis: Where we first reduce the dimensionality of data 
e Clustering: Where we create clusters within data 


e Discriminant analysis: Measure how well the data properties are 
captured 


Clustering analysis will involve all or some of the three steps as standalone 
processes. This will give the insights into data distribution, which can be used to create 
better data decisions or the results can be used to feed into some further algorithm. 

In machine learning, we are not always solving for some predefined target variable, 
exploratory data mining provide us lot of information about the data itself. There are lot 
of applications of clustering in industry; here are some of them. 
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1. Marketing: The clustering algorithm can provide useful 
insights into how distinct groups exist in their customers. It 
can help answer questions like does this group share some 
common demographics? What features to look for while 
creating targeted marketing campaigns? 


e Insurance: Identify the features of group having highest 
average claim cost. Is the cluster made up of specific set of 
people? Is some specific feature driving this high claim cost? 


e Seismology: Earthquake epicenters show a cluster around 
continental faults. Clustering can help identify cluster of 
faults with a higher magnitude of probability than others. 


e Government planning: Clustering can help identify clusters 
of specific household for social schemes, and you can group 
households based on multiple attributes size, income, type, etc. 


e Taxonomy: It’s very popular among biologists to use 
clustering to create taxonomy trees of groups and subgroups 
of similar species. 


There are other methods as well to identify and creates similar groups like 
Q-analysis, multi-dimensional scaling, and latent class analysis. You are encouraged to 
read more about them in any marketing research methods textbook. 


6.9.1 Introduction to Clustering 


In this chapter we will be discussing and illustrating some of the common clustering 
algorithms. The definition of a cluster is loosely defined around the notion of what 
measure we use to find goodness of a cluster. The clustering algorithms largely depend on 
the the type of “clustering model” we assume for our underlying data. 

Clustering model is a notion used to signify what kind of clusters we are trying to 
identify. Here are some common cluster models and the popular algorithms built on 
them. 


e Connectivity models : Distance connectivity between observations 
is the measure, e.g., hierarchical clustering. 


e Centroid models : Distance from mean value of each observation/ 
cluster is the measure, e.g., k-means. 


e Distribution models : Significance of statistical distribution 
of variables in the dataset is the measure, e.g., expectation- 
maximization algorithms. 


e Density models: Density in data space is the measure, e.g., 
DBSCAN models. 
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Further clustering can be of two types: 
e Hard Clustering: Each object belongs to exactly one cluster 


e Soft Clustering : Each object has some likelihood of belonging to a 
different cluster 


In next section we will be showing R demonstrations of these algorithms to 
understand their output. 


6.9.2 Clustering Algorithms 


Clustering algorithms cannot differentiate between relevant and irrelevant variables. It is 
important for the researcher to carefully choose the variables based on which algorithm 
will start identifying patterns/groups in data. This is very important because the clusters 
formed can be very dependent on the variables included. 

A good clustering algorithm can be evaluated based on two primary objectives: 


e High intra-class similarity 
e Lowinter-class similarity 


The other common notion among the clustering algorithm is the measure of quality 
of clustering. The similarity measure used and the implementation becomes important 
determinants of clustering quality measures. Another important factor in measuring 
quality of clustering in its ability to discover hidden patterns. 

Let's first load the House Pricing dataset and see its data description. For posterior 
analysis (after building the model), we have appended the data with HouseNetWorth. This 
house net worth is a function of StoreArea(sq.mt) and LawnArea(sq.mt). 

From market research we will get data in raw format without any target variables. 
The clustering algorithm will show us if we can divide the data in the worth scales if 
possible by different clustering algorithms. 


# Read the house Worth Data 
Data House Worth <-read.csv("Dataset/House Worth Data.csv",header=TRUE) ; 


str(Data House Worth) 


‘data. frame": 316 obs. of 5 variables: 

¢ HousePrice : int 138800 155000 152000 160000 226000 275000 215000 
392000 325000 151000 ... 

$ StoreArea > num 29.9 44 46.2 46.2 48.7 56.4 47.1 56.7 84 49.2 ... 

$ BasementArea : int 75 504 493 510 445 1148 380 945 1572 506 ... 

$ LawnArea : num 11.22 9.69 10.19 6.82 10.92 ... 


$ HouseNetWorth: Factor w/ 3 levels "High", "Low", "Medium": 2 3 3 3 3131 
13... 
#remove the extra column as well not be using this 
Data House Worth$BasementArea <-NULL 
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A quick analysis of scatter plot in Figure 6-41 shows us that there is some relationship 
between the LawnArea and StoreArea. As this is a small dataset and well calibrated we 
can see and interpret the clusters (manual process if also clustering). However, we will 
assume that we didn't have this information prior and let the clustering algorithms tell us 
about these clusters. 


library(ggplot2) 


ggplot(Data House Worth, aes(StoreArea, LawnArea, color = HouseNetWorth) ) 
+geom_point( ) 
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Figure 6-41. Scatterplot between StoreArea and LawnArea for each HouseNetWorth group 


Let's use this data to illustrate different clustering algorithms. 
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6.9.2.1 Hierarchal Clustering 


Hierarchical clustering is based on the connectivity model of clusters. The steps involved 
in the clustering process are: 


1. Start with N clusters, N is number of elements (i.e., assign 
each element to its own cluster). In other words distances 
(similarities) between the clusters are equal to the distances 
(similarities) between the items they contain. 


2. Now merge pairs of clusters with the closest to other (most 
similar clusters) (e.g., the first iteration will reduce the 
number of clusters to N - 1). 


3. Again compute the distance (similarities) and merge with 
closest one. 


4. Repeat Steps 2 and 3 to exhaust the items until you get all data 
points in one cluster. 


Now you will get a dendogram of clustering for all levels. Choose a cutoff at how 
many clusters you want to have by stopping the iteration at the right point. 

In R, we use the hclust() function. Hierarchical cluster analysis on a set of 
dissimilarities and methods for analyzing it. This is part of the stats package. 

Another important function used here is dist (); this function computes and returns 
the distance matrix computed by using the specified distance measure to compute the 
distances between the rows of a data matrix. By default, it is Euclidean distance. 

Mathematically, Euclidean distance is given as 

In Cartesian coordinates, Euclidean distance between two vectors p = (p1, p2,..., pn) 
and q = (ql, q2,..., qn) are two points in Euclidean. n-space is given by 


d(p,q)=d(q,p) = (q, -p,) +(q, — DP, y +-+(q, -Pa J 


E (q; -p;). 


i=l 


Let’s apply the hclust function and create our clusters. 


# apply the hierarchal clustering algorithm 
clusters <-hclust(dist(Data House Worth[,2:3])) 


#Plot the dendogram 
plot(clusters) 
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Cluster Dendrogram 
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dist(Data_House_Worth], 2:3]) 
hclust (*, "complete") 


Figure 6-42. Cluster dendogram 


Now we can see there are number of possible places where we can choose clusters. 
We will show cross-plot with 2, 3, and 4 clusters. 


# Create different number of clusters 

clusterCut_2 <-cutree(clusters, 2) 

#table the clustering distribution with actual networth 
table(clusterCut_2,Data House Worth$HouseNetWorth) 


clusterCut_2 High Low Medium 
1 104 135 51 
2 26 O 0 
clusterCut_3 <-cutree(clusters, 3) 
#table the clustering distribution with actual networth 
table(clusterCut_3,Data House Worth$HouseNetWorth) 


clusterCut_3 High Low Medium 


1 O 122 1 
2 104 13 50 
3 26 O 0 


clusterCut_4 <-cutree(clusters, 4) 
#table the clustering distribution with actual networth 
table(clusterCut_4,Data House Worth$HouseNetWorth) 
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clusterCut_4 High Low Medium 


1 O 122 1 
2 34 9 50 
3 70 4 0 
4 26 O 0 


These three separate tables show how much the clusters able to capture the feature 
of net worth. Let's limit ourselves to three clusters as we know from additional knowledge 
that there are three groups of house by net worth. In statistical terms, the best number 
of clusters can be chosen by using elbow method and/or semi-partial R-Square, validity 
index Pseudo E More details can be learned from Timm, Neil H., Applied Multivariate 
Analysis, Springer, 2002. 


ggplot(Data House Worth, aes(StoreArea, LawnArea, color = HouseNetWorth)) + 
geom_point(alpha =0.4, size =3.5) +geom_point(col = clusterCut_ 3) + 
scale_color_manual(values =c('black', ‘red’, 'green')) 
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Figure 6-43. Cluster plot with LawnArea and StoreArea 


You can see most of our “high’, “medium” and “low” NetHouseworth points are 
overlapping with the three cluster created by hclust. In hindsight, if we didn't know 
the actual networth scales, we could have retrieved this information from this cluster 
analysis. 
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In next section, we apply another clustering algorithm to the same data and see how 
the results look. 


6.9.2.2 Centroid-Based Clustering 


The following text borrowed from the original paper, A K-Means Clustering Algorithm 
by Hartigan et. al [6] gives the most crisp and precise description of the way k-means 
algorithm works, it says: 


The aim of the K-means algorithm is to divide M points in N 
dimensions into K clusters so that the within-cluster sum of 
squares is minimized. It is not practical to require that the 
solution has minimal sum of squares against all partitions, 
except when M, N are small and K = 2. We seek instead "local” 
optima, solution such that no movement of a point from one 
cluster to another will reduce the within-cluster sum of squares. 


where, within cluster sum of squares (WCSS) is sum of distance of each observation 
in a cluster to its centroid. More technically, for a set of observations (x, Rp x ) and set 
of k clusters C = {C,,C,,...,C,} 


wess=5°¥ |x- p Í 


i=l xeC,; 








p, is the mean of points in C, 
Algorithm 
In the simplest form of the algorithm, it has two steps: 


e Assignment: Assign each observation to the cluster that gives the 
minimum within cluster sum of squares (WCSS). 


e Update: Update the centroid by taking the mean of all the 
observation in the cluster. 


These two steps are iteratively executed until the assignments in any two consecutive 
iteration don’t change, meaning either a point of local or global optima (not always 
guaranteed) is reached. 


Note For interested readers, Hartigan et. al, in their original paper, describes a seven- 
step procedure. 
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Let's use our HouseNetWorth data to show k-means clustering. Unlike the 
hierarchical cluster, to find the optimal value for k (number of cluster) here, we will use 
an Elbow curve. The curve shows the percentage of variance explained as a function of 
the number of clusters. 


# Elbow Curve 
wss <-(nrow(Data House Worth)-1)*sum(apply(Data House Worth[ ,2:3],2,var)) 


for (i in 2:15) { 
wss[i] <-sum(kmeans(Data House Worth[,2:3],centers=i)$withinss ) 


plot(1:15, wss, type="b", xlab="Number of Clusters",ylab="Within groups sum 
of squares") 
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Figure 6-44. Elbow curve for varying values of k (number of clusters) on the x axis 


The elbow curve suggests that with three clusters, we were able to explain most 
of the variance in data. Beyond four clusters adding more clusters is not helping with 
explaining the groups (as the plot in Figure 6-44, shows, WCSS is saturating after three). 
Hence, we will once again choose k=3 clusters. 


set.seed(917) 


#Run k-means cluster of the dataset 
Cluster _kmean <-kmeans(Data House Worth[,2:3], 3, nstart =20) 
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#Tabulate the cross distribution 
table(Cluster_kmean$cluster,Data_ House Worth$HouseNetWorth) 


High Low Medium 


1 84 0 0 
2 46 13 50 
3 O 122 1 


The table shows cluster 1 has only of “High” worth, while clusters 2 and 3 have all of 
it. While cluster 3 only represents the low worth except for one point. Here is the plot of 
clusters against the actual networth. 


Cluster _kmean$cluster <-factor(Cluster_kmean$cluster) 
ggplot(Data House Worth, aes(StoreArea, LawnArea, color = HouseNetWorth)) + 
geom_point(alpha =0.4, size =3.5) +geom_point(col = Cluster kmean$cluster) + 
scale_color_manual(values =c('black', 'red', 'green')) 
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Figure 6-45. Cluster plot using k-means 


In the Figure 6-45, we can see in k-means have captured the clusters very well. 
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6.9.2.3 Distribution-Based Clustering 


Distribution methods are iterative methods to fit a set of dataset into clusters by 
optimizing distributions of datasets in clusters. Gaussian distribution is nothing but 
normal distribution. This method works in three steps: 


1. First randomly choose Gaussian parameters and fit it to set of 
data points. 


2. Iteratively optimize the distribution parameters to fit as many 
points it can. 


3. Once it converges to a local minima, you can assign data 
points closer to that distribution of that cluster. 


Although this algorithm create complex models, it does capture correlation and 
dependence among the attributes. The downside is that these methods usually suffer 
from an overfitting problem. Here, we show example of algorithm on our house worth 
data. 


library(EMCluster, quietly =TRUE) 


ret <-init.EM(Data House Worth[,2:3], nclass =3) 
ret 

Method: em.EMRnd.EM 

n = 316, p = 2, nclass = 3, flag = 0, logl = -1871.0336. 

nc: 

[1] 48 100 168 

pi: 

[1] 0.2001 0.2508 0.5492 
ret.new <-assign.class(Data House Worth[,2:3], ret, return.all =FALSE) 


#This has assigned a class to each case 
str (ret .new) 

List of 2 

$ nc : int [1:3] 48 100 168 

$ class: num [1:316] 13 31333333... 
# Plot results 
plotem(ret,Data House Worth[ ,2:3]) 
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n=316 K=3 
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Figure 6-46. Clustering plot-based on the EM algorithm 


The low worth and high worth is captured well in the algorithm, while the medium 
ones are far more scattered, which is not well represented in the cluster. Now let’s see how 
the scatter plot looks by clustering with this method. 


ggplot(Data House Worth, aes(StoreArea, LawnArea, color = HouseNetWorth)) + 


geom_point(alpha =0.4, size =3.5) +geom_point(col = ret.new$class) + 
scale_color_manual(values =c('black', 'red', 'green')) 
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Figure 6-47. Cluster plot for the EM algorithm 


Again, the plot is good for high and low classes, but isn’t doing very well for the 
medium class. There are some cases scattered in high LawnArea values as well, but 


HouseNetWorth 
@ High 

© Low 

© Medium 


comparatively it’s still captured better in the high cluster. If you observe, this method isn’t 
as good as what we saw in the case of hclust or k-means as there are many more overlaps 


of points between two clusters. 


6.9.2.4 Density-Based Clustering 


Density-based spatial clustering of applications with noise (DBSCAN) is a data clustering 
algorithm proposed by Martin Ester, Hans-Peter Kriegel, Jorg Sander, and Xiaowei Xu in 


1996. 


This algorithm works on a parametric approach. The two parameters involved in this 


algorithm are: 


e <£: The radius of our neighborhoods around a data point p. 


e minPts: The minimum number of data points we want ina 


neighborhood to define a cluster. 
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Once these parameters are defined, the algorithm divides the data points into three 
points: 


e Core points: A point p is a core point if at least minPts points are 
within distance e (e is the maximum radius of the neighborhood 
from p) of it (including p). 


e Border points: A point q is border from p if there is a path p1, ..., 
pn with pl = p and pn = q, where each pi+1 is directly reachable 
from pi (all the points on the path must be core points, with the 
possible exception of q). 


e Outliers: All points not reachable from any other point are 
outliers. 


The steps in DBSCAN are simple after defining the previous steps: 


1. Pick at random a point which is not assigned to a cluster and 
calculate its neighborhood. If, in the neighborhood, this point 
has minPts then make a cluster around that; otherwise, mark 
it as outlier. 


2. Once you find all the core points, start expanding that to 
include border points. 


3. Repeat these steps until all the points are either assigned to a 
cluster or to an outlier. 


library (dbscan) 

Warning: package ‘dbscan’ was built under R version 3.2.5 
cluster dbscan <-dbscan(Data House Worth[,2:3],eps=0.8,minPts =10) 
cluster dbscan 

DBSCAN clustering for 316 objects. 

Parameters: eps = 0.8, minPts = 10 

The clustering contains 5 cluster(s) and 226 noise points. 


O 1 2 3 4 5 
226 10 25 24 15 16 


Available fields: cluster, eps, minPts 


#Display the hull plot 
hullplot (Data House Worth[ ,2:3],cluster_dbscan$cluster) 
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Convex Cluster Hulls 
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Figure 6-48. Plot for convex cluster hulls for the EM algorithm 


The result shows DBSCAN has found five clusters and assigned 226 cases as noise/ 
outliers. The hull plot shows the separation is good, so we can play around with the 
parameters to get more generalized or specialized clusters. 


6.9.3 Internal Evaluation 


When a clustering result is evaluated based on the data that was clustered, it is called 
internal evaluation. These methods usually assign the best score to the algorithm that 
produces clusters with high similarity within a cluster and low similarity between 
clusters. 


6.9.3.1 Dunn Index 


J Dunn proposed this index in 1974 through his published work titled, “Well Separated 
Clusters and Optimal Fuzzy Partitions,” Journal of Cybernetics.[7]| 

The Dunn index aims to identify dense and well-separated clusters. It is defined as 
the ratio between the minimal intercluster distances to the maximal intracluster distance. 
For each cluster partition, the Dunn index can be calculated using the following formula. 


po 
g max d'(k) 


1<k<n 
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where d(i,j) represents the distance between cluster i and j, and d'(k) measures the 
intra-cluster distance of cluster k. 


library(clValid) 

#Showing for hierarchical cluster with clusters = 3 

dunn(dist(Data House Worth[,2:3]), clusterCut_3) 
[1] 0.009965404 


The Dunn Index has a value between zero and infinity and should be maximized. 
The Dunn score with high value are more desirable; here the value is too low suggesting 
it’s not a good cluster. 


6.9.3.2 Silhouette Coefficient 


The silhouette coefficient contrasts the average distance to elements in the same cluster 
with the average distance to elements in other clusters. Objects with a high silhouette 
value are considered well clustered; objects with a low value may be outliers. 
library(cluster) 


#Showing for k-means cluster with clusters = 3 
sk <-silhouette(clusterCut_3,dist(Data_ House Worth[,2:3])) 


plot (sk) 


Silhouette Plot for each cluster 


-0.5 0.0 0.5 1.0 
Silhouette width S; 


Average silhouette width : 0.63 


Figure 6-49. Silhouette plot 


352 


CHAPTER 6 = MACHINE LEARNING THEORY AND PRACTICES 


The silhouette plot shows how the three clusters behave on silhouette width. 


6.9.4 External Evaluation 


External evaluation is similar to evaluation done on test data. The data used for testing 

is not used for training the model. The test data is then evaluated and labels assigned by 
experts or some third party benchmarks. Then clustering results on these already labeled 
items provide us the metric for how good the clusters grouped our data. As the metric 
depends on external inputs, it is called external evaluation. 

The method is simple if we know what the actual clusters will look like. Then we can 
have these evaluations. In our case we already know the house worth indicator, hence 
we can calculate these evaluation metrics. In reality, our data is already labeled before 
clustering and hence we can do external evaluation on the same data as we used for 
clustering. 


6.9.4.1 Rand Measure 


The Rand index is similar to classification rate in multi-class classification problems. 
This measures how many items that are returned by the cluster and expert (labeled) 
are common and how many differ. If we assume expert labels (or external labels) to be 
correct than this measure the correct classification rate. It can be computed using the 
following formula [8]: 


= TP+TN 
TP +FP+FN+TN 


where TP is the number of true positives, TN is the number of true negatives, FP is 
the number of false positives, and FN is the number of false negatives. 


#Unsign result from EM Algo 
library(EMCluster) 
clust <-ret.new$class 
orig <-ifelse(Data House Worth$HouseNetWorth == "High",2, 
ifelse(Data House Worth$HouseNetWorth == "Low",1,2)) 
RRand(orig, clust) 
Rand adjRand Eindex 
0.7653 0.5321 0.4099 


The Rand index value is high, 0.76, indicating a good clustering fit by our EM 
method. 
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6.9.4.2 Jaccard Index 


The Jaccard index is similar to Rand index. The Jaccard index measures the overlap of 
external labels and labels generated by the cluster algorithms. The Jaccard index value 
varies between 0 and 1, 0 implying no overlap while 1 means identical datasets. The 
Jaccard index is defined by the following formula: 


—|AOB _ TP 
~|AUB| TP+EP+FN 





J(A,B) 


where TP is the number of true positives, FP is the number of false positives, and 
FN is the number of false negatives. 


#Unsign result from EM Algo 
#Unsign result from EM Algo 
library (EMCluster) 
clust <-ret.new$class 
orig <-ifelse(Data House Worth$HouseNetWorth == "High",2, 
ifelse(Data House Worth$HouseNetWorth == "Low",1,2)) 
Jaccard.Index(orig, clust) 
[1] 0.1024096 


The index value is low, suggesting only 10% values are common. This implies the 
overlap on the original and cluster is low. Not a good cluster formation. 


6.9.5 Conclusion 


These supervised learning algorithms have a wide-spread adaptability in industry and 
research. The underlying design of decision tree makes it easy to interpret and the model 
is very intuitive to connect with the real-world problem. The approaches like boosting 
and bagging have given rise to high accuracy models based on decision tree. In particular, 
Random Forest is now one of the widely used model for many classification problems. 

We presented a detailed discussion of decision tree where we started with the very 
first decision tree models like ID3 and went on to present the contemporary bagging 
CART and Random Forest algorithms as well. 

In the next section, we will see association rule mining, which works on transactional 
data and has found application in market basket analysis and led to many powerful 
recommendation algorithms. 


6.10 Association Rule Mining 


Association rule learning is a method for discovering interesting relations between 
variables in large databases. It is intended to identify strong rules discovered in databases 
using some measures of interestingness. Based on the concept of strong rules, Rakesh 
Agrawal et al. introduced association rules for discovering regularities between products 
in large-scale transaction data recorded by point-of-sale (POS) systems in supermarkets. 
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For example, the rule {onions,potatoes} => {burger} found in the sales data of a 


supermarket would indicate that if a customer buys onions and potatoes together, they 
are likely to also buy hamburger meat. Library is another good example where rule 
mining plays an important role to keep books and stock up. The Hossain and Rashedur 
paper entitled “Library Material Acquisition Based on Association Rule Mining” is a good 
read to expand the idea of association rule mining that can be applied in real situations. 


6.10.1 Introduction to Association Concepts 


Transactional data is generally a rich source of information for a business. In the 
traditional scheme of things, businesses were looking at such data from the perspective of 
reporting and producing dashboards for the executives to understand the health of their 
business. In the pioneer research paper, “Mining Association Rules between Sets of Items 
in Large Databases,’ by Agrawal et.al. proposed an alternative to use this data: 


e Boosting the sale of a product (item) in a store 

e Impact of discontinuing a product 

e Bundling multiple products together for promotions 
e Better shelf planning in a physical supermarket 

e Customer segmentation based on buying patterns 


Consider the Market Basket data, where 
Item set, I= {bread and cake, baking needs, biscuits, canned fruit, canned 


vegetables, frozen foods, laundry needs, deodorants soap, and jam spreads} 


Database, D= TL, LT; 


And {bread and cake, baking needs, Jams spreads} is a subset of the item set I and 
{bread and cake, baking needs} => {Jams spreads} is a typical rule. 


The following sections describe some useful measures that can help us iterate 
through the algorithm. 


6.10.1.1 Support 


Support is the proportion of transactions in which an item set appears. We will denote it 
by supp(X), where X is an item set. For example, 


supp( {bread and cake, baking needs} ) =2/5=0.4 and 


supp({bread and cake} ) =3/5=0.6. 


355 


CHAPTER 6 = MACHINE LEARNING THEORY AND PRACTICES 


6.10.1.2 Confidence 


While support helps in understanding the strength of an item set, confidence indicates 
the strength of a rule. For example, in the rule 
{bread and cake, baking needs} => {Jams spreads} , confidence is the conditional 


probability of finding the item set {Jams spreads} (RHS) in transactions under the 
condition that these transactions also contain the {bread and cake, baking needs } (LHS). 
More technically, confidence is defined as 


supp(XUY) 


conf(X=>Y)= a) 


The rule {bread and cake, baking needs} = {Jams spreads} has a confidence of 


0.2/0.4=0.5 , which means 50% of the time when the customer buys 
{bread and cake, baking needs }, they buy {Jams spreads} as well. 


6.10.1.3 Lift 


If the LHS and RHS of a rule is independent of each other, i.e., the purchase of one doesn’t 
depend on the other, then lift is a ratio between the observed support to the expected 
support. So, if lift = 1, LHS and RHS are independent of each other and it doesn’t make 
any sense to have such a rule, whereas if the lift is > 1, it tells the degree to which the two 
occurrences are dependent on one another. 

More technically 


supp(XUY) 


lift( X = Y ) = SY 
nas supp(X)xsupp(Y) 


The rule {bread and cake, baking needs} => {Jams spreads} has a lift of 
0.2 
0.4x0.4 


1.25 times more likely to buy {Jams spreads} than the typical customer. 

There are some other measures like conviction, leverage, and collective strength; 
however, these three are found to be widely used in all the literature and sufficient for the 
understanding of the Apriori algorithm. 

Things might work well for small examples like these, however, in practicality, a 
typical database of transactions is very large. Agrawal et. al. proposed a simple yet fast 
algorithm to work on large databases. 





—].25, which means people who buy {bread and cake, baking needs } are nearly 


1. Inthe first step, all possible candidate item sets are generated. 
2. Then the rules are formed using these candidate item sets. 


3. The rule with the highest lift is generally the preferred choice. 


356 


CHAPTER 6» MACHINE LEARNING THEORY AND PRACTICES 


Later, many variations of the Apriori algorithm have been devised but in its original 
form. Here are the steps involved in the Apriori algorithm for generating candidate item 
sets: 


1. Determine the support of the one element item sets (a.k.a. 
singletons) and discard the infrequent items/item sets. 


2. Form candidate item sets with two items (both items must be 
frequent), determine their support, and discard the infrequent 
item sets. 


3. Form candidate item sets with three items (all contained pairs 
must be frequent), determine their support, and discard the 
infrequent item sets. 


4. Continue by forming candidate item sets with four, five, and 
so on, items until no candidate item set is frequent. 


6.10.2 Rule-Mining Algorithms 


We will use the Market Basket data to demonstrate this algorithm in R 


library (arules) 
MarketBasket <-read.transactions("Dataset/MarketBasketProcessed.csv", sep 
=",") 
summary (MarketBasket ) 
transactions as itemMatrix in sparse format with 
4601 rows (elements/itemsets/transactions) and 
100 columns (items) and a density of 0.1728711 


most frequent items: 


bread and cake fruit vegetables milk cream baking needs 
3330 2962 2961 2939 2795 

(Other) 

64551 


element (itemset/transaction) length distribution: 
sizes 
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 
30 15 14 36 44 75 72 111 144 177 227 227 290 277 286 302 239 247 
19 20 21 22 23 24 25 26 27 28 29 30 31 32 #33 34 35 36 
193 191 170 199 160 153 108 125 90 94 59 43 45 35 36 27 16 11 
37 38 39 40 41 42 43 47 
11 6 4 5 4 1 1 1 


Min. 1st Qu. Median Mean 3rd Qu. Max. 
1.00 12.00 16.00 17.29 22.00 47.00 
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includes extended item information - examples: 


labels 
1 canned vegetables 
2 750ml red imp 
3 750ml red nz 


#Transactions - First two 
inspect (MarketBasket[1:2]) 
items 
1 {baking needs, 
beef, 
biscuits, 
bread and cake, 
canned fruit, 
dairy foods, 
fruit, 
health food other, 
juice sat cord ms, 
lamb, 
puddings deserts, 
sauces gravy pkle, 
small goods, 
Stationary, 
vegetables, 
wrapping } 
2 {750ml white nz, 
baby needs, 
baking needs, 
biscuits, 
bread and cake, 
canned vegetables, 
cheese, 
cleaners polishers, 
coffee, 
confectionary, 
dishcloths scour, 
frozen foods, 
fruit, 
juice sat cord ms, 
margarine, 
mens toiletries, 
milk cream, 
party snack foods, 
razor blades, 
Sauces gravy pkle, 
small goods, 
tissues paper prd, 
vegetables, 
wrapping } 
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# Top 20 frequently bought product 
itemFrequencyPlot(MarketBasket, topN =20) 








“6, 3 
SS Y 
4, “Ss ÀN 
b, © S 
ft Oo 
“by Mp O a 
i) Oe, #3 n YU 
Sb D, 1H 
“o, “Oy x = 
bo, py Ss a m 2 a l J : a e anag 
ww B OD as Pas ae OLE 
PAG S = om gan 
a o o bmm oho 
%%, = S 
Ay ga YV S 
rA X > 
vey Ta ~ WY 
&, >; = v 
(May Sa, S sa 
L e %% È = 
& 0, KN R ~ “i a gogo Se wee 
i om a 
art, d ~ $ = "fe "vep ee E% a 
is = E a = % a sm a 8 ome m 
rD FF H a 
oS He, Fe = T o a "= . a a 
od Phe > E- 2 pE 95 SRE g E g E EP F 
oY i) = Su an Pal 
a A g Sa i 
(6% So “b 
i Zo KA A = sre ep BG ia b "a oF 
= %, ty 9 S Ww S os ā mop woe wo oe a ga 5,5 fm, T5 
N & s> e LECS T U E 
ao) H g dei he H a Lo l 
Gy F =i | a fs 
E 5 fis ri = 1i . 
D, T DE e E a iii jik buju '* 
oS s aS 
A Nd Ss oO = = =] 5 
č oy “T wÙ oo 
o E S 
(eane) Aousnbayy way co te oj (SMO) SUDIJDBSUBL | 


359 


80 


60 


Items (Columns) 


20 
Figure 6-51. Scarcity visualization in transactions of market basket data 
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6.10.2.1  Apriori 


The support and confidence could be modified to allow for flexibility in this model. 
a. Model Building 


We are using the Apriori function from the package arules to demonstrate the 
market basket analysis with the constant values for support and confidence: 


library (arules) 


MarketBasketRules<-apriori(MarketBasket, parameter =list(support =0.2, 
confidence =0.8, minlen =2)) 
MarketBasketRules 


Parameter specification: 
confidence minval smax arem aval originalSupport support minlen maxlen 
0.8 0.1 1 none FALSE TRUE 0.2 2 10 
target ext 
rules FALSE 


Algorithmic control: 
filter tree heap memopt load sort verbose 
0.1 TRUE TRUE FALSE TRUE 2 TRUE 


writing ... [190 rule(s)] done [0.00s]. 
creating S4 object ... done [0.00s]. 


b. Model Summary 
The top five rules and lift: 


{beef, fruit} => {vegetables} - 1.291832 

{biscuits,bread and cake, frozen foods,vegetables} => {fruit} - 1.288442 
{biscuits,milk cream,vegetables} => {fruit} - 1.275601 

{bread and cake,fruit,sauces gravy pkle} =>{vegetables} - 1.273765 
{biscuits, bread and cake,vegetables} => {fruit} - 1.270252 


summary (MarketBasketRules ) 
set of 190 rules 


rule length distribution (lhs + rhs):sizes 
2 3 4 5 
4 93 84 9 


Min. 1st Qu. Median Mean 3rd Qu. Max. 
2.000 3.000 3.000 3.516 4.000 5.000 
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summary of quality measures: 
Support confidence lift 


Min. 70.2002 Min. 70.8003 Min. 71.106 
1st Qu.:0.2082 1st Qu.:0.8164 1st Qu.:1.138 
Median :0.2260 Median :0.8299 Median :1.160 
Mean 0.2394 Mean 70.8327 Mean 71.169 
3rd Qu.:0.2642 3rd Qu.:0.8479 3rd Qu.:1.187 
Max. 70.3980 Max. 70.8941 Max. 21.292 


mining info: 
data ntransactions support confidence 
MarketBasket 4601 0.2 0.8 
Sorting grocery rules by lift: 


inspect (sort(MarketBasketRules, by ="lift")[1:5]) 


lhs rhs Support confidence lift 
1 {beef, 
fruit} => {vegetables} 0.2143012 0.8313659 1.291832 


2 {biscuits, 
bread and cake, 
frozen foods, 
vegetables} => {fruit} 0.2019126 0.8294643 1.288442 
3 {biscuits, 
milk cream, 
vegetables} => {fruit} 0.2206042 0.8211974 1.275601 
4 {bread and cake, 
fruit, 
sauces gravy pkle} => {vegetables} 0.2184308 0.8197390 1.273765 
5 {biscuits, 
bread and cake, 
vegetables} => {fruit} 0.2642904 0.8177539 1.270252 
# store as data frame 
MarketBasketRules df <-as(MarketBasketRules, "data.frame") 
str(MarketBasketRules df) 


‘data. frame’: 190 obs. of 4 variables: 

$ rules : Factor w/ 190 levels "{baking needs,beef} => {bread and 
cake}",..: 189 106 163 174 60 116 115 118 16 120... 

$ Support : num 0.203 0.226 0.223 0.398 0.201 ... 

$ confidence: num 0.835 0.811 0.804 0.8 0.815 ... 

$ lift : Num 1.15 1.12 1.11 1.11 1.13 ... 
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6.10.2.2 Eclat 


Eclat is another algorithm for association rule mining. This algorithm uses simple 
intersection operations for equivalence class clustering along with bottom-up lattice 
traversal. The following two references talks about the algorithm in detail: 


e Mohammed J. Zaki, Srinivasan Parthasarathy, Mitsunori 
Ogihara, and Wei Li. (1997, “New Algorithms for Fast Discovery 
of Association Rules,’ Technical Report 651, Computer Science 
Department, University of Rochester, Rochester, NY 14627. 


e Christian Borgelt (2003), “Efficient Implementations of 
Apriori and Eclat,’ Workshop of Frequent Item Set Mining 
Implementations (FIMI), Melbourne, FL, USA. 


a. Model Building 


library (arules) 
With support = 0.2 


MarketBasketRules Eclat<-eclat(MarketBasket, parameter =list(support =0.2, 
minlen =2)) 


parameter specification: 
tidLists support minlen maxlen target ext 
FALSE 0.2 2 10 frequent itemsets FALSE 


algorithmic control: 
Sparse sort verbose 
7 -2 TRUE 


writing ... [531 set(s)] done [0.00s]. 

Creating S4 object ... done [0.00s]. 
With support = 0.1 
MarketBasketRules Eclat<-eclat(MarketBasket, parameter =list(supp =0.1, 
maxlen =15)) 


parameter specification: 
tidLists support minlen maxlen target ext 
FALSE 0.1 1 15 frequent itemsets FALSE 


algorithmic control: 


sparse sort verbose 
l -2 TRUE 
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writing ... [7503 set(s)] done [0.00s]. 

Creating S4 object ... done [0.00s]. 
Observe the increase in the number of rules by decreasing the support. 
Experiment with the support value to see how the rules are changing. 


b. Model Summary 
The top five rules and support: 


{bread and cake,milk cream} 0.5079331 
{bread and cake, fruit} 0.5053249 

{bread and cake, vegetables} 0.4994566 
{fruit, vegetables} 0.4796783 

{baking needs,bread and cake} 0.4762008 


summary (MarketBasketRules Eclat) # the model with support = 0.1 
set of 531 itemsets 


most frequent items: 


bread and cake vegetables fruit baking needs frozen foods 
196 137 136 130 122 

(Other) 

772 


element (itemset/transaction) length distribution:sizes 
2 3 4 5 
187 260 81 3 


Min. 1st Qu. Median Mean 3rd Qu. Max. 
2.000 2.000 3.000 2.812 3.000 5.000 


summary of quality measures: 


support 
Min. :0.2002 
1st Qu.:0.2118 
Median :0.2378 
Mean :0.2539 
3rd Qu.:0.2768 
Max. :0.5079 


includes transaction ID lists: FALSE 
mining info: 


data ntransactions support 
MarketBasket 4601 0.2 
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Sorting grocery rules by support: 


inspect(sort(MarketBasketRules Eclat, by ="support")[1:5]) 
items Support 
1 {bread and cake, 
milk cream} 0.5079331 
2 {bread and cake, 
fruit} 0.5053249 
3 {bread and cake, 
vegetables} 0.4994566 
4 {fruit, 
vegetables} 0.4796783 
5 {baking needs, 
bread and cake} 0.4762008 
# store as data frame 
sroceryrules df <-as(groceryrules, "data. frame") 
str(groceryrules df) 
‘data. frame’: 531 obs. of 2 variables: 
$ items : Factor w/ 531 levels "{baking needs,beef,bread and cake}",..: 
338 250 239 358 357 302 96 334 426 341 ... 
$ Support: num 0.203 0.213 0.226 0.203 0.2 ... 


The results are shown based on the support of the item sets rather than the lift. This 
is because Eclat only mines frequent item sets. There are no output of the lift measure, for 
which Apriori is a more suitable approach. Nevertheless, this output shows the top five 
item sets with highest support, which could be further used to generate rules. 


6.10.3 Recommendation Algorithms 


In the preceding section, we saw association rule mining, which could have been used 
to generate product recommendations for customer based on their purchase history. 
For n-products, each customer will be represented by n-dimensional 0-1 vector, where 1 
means the customer has brought the corresponding product, 0 otherwise. Based on the 
rules with highest lift, we could recommend the product on the RHS of the rule to all the 
customers who bought products in the LHS. 

This might work if the scarcity in the data isn't too high; however, there are more 
robust and efficient algorithms, collectively known as the Recommendation Algorithm. 
Originally, the use case of these algorithms got its popularity from Amazon's Product 
and Netflix's Movie recommendation system. A significant amount of research has 
been done in this area in last couple of years. One of the most elegant implementations 
of a recommender algorithm can be found in the recommenderlab package in R, 
developed by Michael Hahsler. It has been well documented in the CRAN articles by the 
title, “recommenderlab: A Framework for Developing and Testing Recommendation 
Algorithms,’ (https: //cran.r-project.org/web/packages/recommenderlab/ 
vignettes/recommenderlab.pdf). This article not only elaborates on the usage of the 
package but also gives a good introduction of the various recommendation algorithms. 
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In this book, we discuss the collaborative filtering-based approach for food 
recommendations using the Amazon Fine Foods Review dataset. The data has one row 
for each user and their ratings between 1 to 5 (lowest to highest), as described earlier in 
the text mining section. Two of the most popular recommendation algorithms, user- 
based and item-based collaborative filtering, are presented in this book. We encourage 
you to refer to the recommenderlab article from CRAN for a more elaborate discussion. 


6.10.3.1 User-Based Collaborative Filtering (UBCF) 


The UBCF algorithm works on the assumption that users with similar preferences will 
rate similarly. For example, if a user A likes spaghetti noodles with a rating of 4 and if user 
B has similar taste as user A, he will rate the spaghetti close enough to 4. This approach 
might not work if you consider only two users; however if instead of two, we find three 
users closest to user A and consider their rating for spaghetti noodles in a collaborative 
manner (could be as simple as taking an average), we could produce a much accurate 
rating of user B to spaghetti noodles. For a given user-product (item) rating matrix, as 
shown in Figure 6-52, the algorithm works as follows: 


1. We compute the similarity between two users using either 
cosine similarity or Pearson correlation, two of the most 
widely used approaches for comparing two vectors. 


SIM pearson (x,y ) = (l1/-1) sa(£) sd(y) 


and 
SIM cosine (x,y) = | 4 


xy 
zl 








2. Based on the similarity measure, choose the k-nearest 
neighbor to the user to whom recommendation has to be 
given. 


3. Take an average of the ratings of the k-nearest neighbors. 

4. Recommend the top N products based on the rating vector. 
Some additional notes about the previous algorithm: 

e We could normalize the rating matrix to remove any user bias. 


e Inthis approach we treat each user equally in terms of similarity; 
however, it’s possible that some users in the neighborhood are 
more similar to U, than others. In this case, we could assign 
certain weights to allow for some flexibility. 
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Note NAratings are treated as 0. 
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Figure 6-52. Illustration of UBCF 
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As shown in Figure 6-54, for a new user, U „ with a rating vector <2.0,NA,2.0,NA 
,9.0,NA,NA,3.0,3.0>, we would like to find the missing ratings. Based on the Pearson 
correlation, for k = 3, the users, U, U,, and U, are the nearest neighbors to U,. Take an 
average of ratings by these users and you will get <2.7,3.3,3.0,1.0,5.0,1.3,2.0,2.7,2.0>. We 
might recommend product P5 and P2 to U, based on these rating vectors. 


6.10.3.2 ltem-Based Collaborative Filtering (IBCF) 


IBCF is similar to UBCF but here items are compared with items based on the 
relationship between items inferred from the rating matrix. A similarity matrix is thus 
obtained using again either cosine or Pearson correlation. 
Since IBCF removes any user bias and could be precomputed, it’s generally 
considered more efficient but is known to produce slightly inferior results to UBCE 
Let’s now use Amazon Fine Food Review and apply the UBCF and IBCF algorithms 
from the recommender lab package in R. 
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a. Loading Data 
library (data.table) 


fine food data <-read.csw("Food Reviews.csv",stringsAsFactors =FALSE) 
fine food data$Score <-as.factor(fine food data$Score) 


str(fine food data[-10]) 


‘data. frame": 35173 obs. of 9 variables: 

$ Id : int 12345678910... 

$ ProductId : chr "“BOO1E4KFGO" "BOO813GRG4" "“BOOOLQOCHO" 
"BOOOUAOOIO" ... 

$ UserlId : chr "A3SGXH7AUHU8GW" "A1D87F6ZCVESNK" 
"ABXLMWJIXXAIN" "“A395BORC6FGVXV" ... 

$ ProfileName : chr ‘“delmartian” “dll pa" “Natalia Corres 
\"Natalia Corres\"" "Karl" ... 


$ HelpfulnessNumerator : int 1013000010... 
$ HelpfulnessDenominator: int 1013000010... 


$ Score : Factor w/ 5 levels "1","2","3","4",..:5142 5 
45555... 

$ Time : int 1303862400 1346976000 1219017600 1307923200 
1350777600 1342051200 1340150400 1336003200 1322006400 1351209600 ... 

$ Summary : chr “Good Quality Dog Food" "Not as Advertised" 


"\"Delight\" says it all" "Cough Medicine" ... 


b. Data Preparation 
library(caTools) 
# Randomly split data and use only 10% of the dataset 


set.seed(90) 
split =sample.split(fine food data$Score, SplitRatio =0.05) 


fine food data =subset(fine food data, split ==TRUE) 
select_col <-e("UserId", "ProductId", "Score" ) 


fine food data selected <-fine food data[,select col] 
rownames(fine food data selected) <-NULL 


fine food data selected$Score =as.numeric(fine food data selected$Score) 


#Remove Duplicates 
fine food data selected <-unique(fine food data selected) 
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c. Creation Rating Matrix 


We will use a function called dcast to create the rating matrix from the review ratings 
from the Amazon Fine Food Review dataset: 


library (recommender1lab) 
#RatingsMatrix 


RatingMat <-deast(fine food data selected,UserId ~ProductId, value.var 
="Score") 

User=RatingMat[ ,1] 

Product=colnames(RatingMat)[2:ncol(RatingMat) ] 

RatingMat[,1] <-NULL 

RatingMat <-as.matrix(RatingMat ) 

dimnames(RatingMat) =list(user = User , product = Product) 


realM <-as(RatingMat, "realRatingMatrix") 
d. Exploring the Rating Matrix 
#distribution of ratings 


hist(getRatings(realM), breaks=15, main ="Distribution of Ratings", xlab 
="Ratings", col ="grey") 
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Figure 6-53. Distribution of ratings 
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#Sparse Matrix Representation 
head(as(realM, "data. frame") ) 
user item rating 

467 A10012K7DF3SBQ BOOOSATIG4 
1381 A10080F3B083XV BOO4BKP68Q 
428 A1031BS8KG7I02 BOOOPDY3PO0 
1396 A1074ZS6AOHICU BOO4JOTAKW 
951 A107M01RZUQ8V BOO1RVFDOO 
1520 A108GQ9A91JIP4 B005K401VI 1 

#The realRatingMatrix can be coerced back into a matrix which is identical 

to the original matrix 

identical(as(realM, "matrix") ,RatingMat) 
[1] TRUE 

#Scarcity in Rating Matrix 

image(realM, main ="Raw Ratings") 
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Figure 6-54. Raw ratings by users 
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e. UBCF Recommendation Model 


#UBCF Model 
r UBCF <-Recommender(realM[1:1700], method ="UBCF") 
r UBCF 
Recommender of type 'UBCF' for 'realRatingMatrix' 
learned using 1700 users. 
#List of objects in the model output 
names (getModel(r UBCF) ) 
[1] "description" "data" "method" nn "sample" 
[6] "normalize" "verbose" 
#Recommend product for the rest of 29 left out observations 
recom UBCF <-predict(r UBCF, realM[1700:1729], n=5) 
recom UBCF 
Recommendations as 'topNList' with n = 5 for 30 users. 
#Display the recommendation 
reco <-as(recom UBCF, "list") 
reco[ lapply(reco, length)>0] 
SAY6MB5S44GMH4 
[1] "BOOOO84EK5" "BO00084EKL" "BOOOO84ETV" "BoOoo8gDF91" "BOOOO8IOLO" 


$AYOMAHLWROHUG 
[1] "B000084EK5" "BOOOO84EKL" "BOOOO84ETV" "BOOOO8DF91" "BOOOO8IOLO" 


$AYX86RC7QV2UT 
[1] "B000084EK5" "BOOOO84EKL" "BOOOO84ETV" "BOOOO8DF91" "BOOOO8IOLO" 


$AZ41FJO1WKBTB 
[1] "B000084EK5" "BOOOO84EKL" "BOOOO84ETV" "“BOOOO8DF91" "BOOOO8IOLO" 


Similarly, you can extract such recommendations for IBCF as well. 


f. Evaluation 


set.seed( 2016) 
scheme <-ewaluationScheme(realM[1:1700], method="split", train = .9, 
k=1, given=1, goodRating=3) 


scheme 

Evaluation scheme with 1 items given 

Method: 'split' with 1 run(s). 

Training set proportion: 0.900 

Good ratings: >=3.000000 

Data set: 1700 x 867 rating matrix of class ‘realRatingMatrix' with 1729 
ratings. 
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algorithms <-list( 

"random items" =list(name="RANDOM", param=NULL), 
"popular items" =list(name="POPULAR", param=NULL), 
"user-based CF" =list(name="UBCF", param=list(nn=50)), 
"item-based CF" =list(name="IBCF", param=list(k=50) ) 

) 


results <-evaluate(scheme, algorithms, type ="topNList", 
n=c(1, 3, 5, 10, 15, 20)) 
RANDOM run fold/sample [model time/prediction time] 
1 [Osec/0.91sec | 
POPULAR run fold/sample [model time/prediction time] 
1 [0.03sec/2.14sec | 
UBCF run fold/sample [model time/prediction time] 
1 [Osec/71.13sec | 
IBCF run fold/sample [model time/prediction time | 
1 [292.13sec/0.46sec | 
plot(results, annotate=c(1,3), legend="bottomright" ) 
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Figure 6-55. True positive ratio versus false positive ratio 


You can see that the ROC curve shows the poor accuracy of these recommendations 
on this data. It could be because of sparsity and high bias in ratings. 
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6.10.4 Conclusion 


Association rule mining could also be thought of as a frequent version of the probabilistic 
model like Naive Bayes. Although we don’t call the terms directly as probabilities, it’s has 
the same roots. These algorithms are particularly well known for their ability to work with 
transaction data in a supermarket or e-commerce platforms where customers usually buy 
more than one product in a single transaction. 

Finding some interesting patterns in such transactions could help reveal a whole 
new direction to the business or increase customer experience through product 
recommendation. In this section we started out with association rule mining algorithms 
like Apriori and went on to discuss the recommendation algorithms, which are 
closely related but take a different approach. Though much of the literature classifies 
the recommendation algorithm as a separate topic of its own, we have kept it under 
association rule mining, to show how the evolution happened. 

In the next chapter, we will discuss one of the widely used models in the world of 
artificial intelligence which derives its root from the biological neural network structures 
in human beings. 


6.11 Artificial Neural Networks 


We will start building our section on neural networks and then introduce deep learning 
toward the end, which is an extension of the neural network. In recent times, deep 
learning has been getting quite a lot of attention from the research community and 
industry for its high accuracies. Neural network-based algorithms have become very 
popular in recent years and take center stage in machine learning algorithms. From a 
statistical leaning point of view, they have become very popular in machine learning for 
two main reasons: 


e Weno longer need to make any assumptions about our data; 
any type of data works in neural networks (categorical and 
numerical). 


e They are scalable techniques, can take in billions of data points, 
and can capture a very high level of abstraction. 


It is important to mention here that neural networks are inspired from the way the 
human brain learns. The recent development in these fields have led to training of far 
dense neural networks, hence making possible to capture signals that other machine 
learning techniques can’t. 


6.11.1 Human Cognitive Learning 


Artificial neural networks are inspired from biological neural networks. The human 
brain is one such large neural network, with neurons being the unit processing in this big 
network. To understand how signals are processed in brain, we need to understand the 
structure of a building block of brain neural network, neurons. 
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Figure 6-56. Neuron anatomy 


In Figure 6-56, you can see anatomy of a neuron. The structure of neuron and its 
function will help us build our artificial neural networks in computer systems. 

A neuron is a smallest unit of neural network, and can be excited by electronic signals. 
It can process and transmit information through electrical and chemical signals. The excited 
state of neuron can be thought as the 1 state in a transistor and a 0 state if not excited. The 
neuron takes input from the dendrites and transmits the signals generated (processed) in the 
cell body through axons. Each axon then connects to other neurons in the network. The next 
neuron then again processes the information and passes it to another neuron. 

Other important issues include the process by which the transfer of signals take 
place. The process takes place through synapses. There is a concept of chemical and 
electrical synapses, while electrical synapse works very fast and transfer continuous 
signals, chemical synapse works on an activation energy concept. The neuron will only 
transmit a signal if the strength of signal is more than a threshold. These important 
features allow neurons to differentiate between signal and noise. 

A big network of such tiny neurons builds up in our nervous system, run by a dense 
neural network in our brain. Our brain learns and stores all the information in that 
densely packed neural network in our head. Scientists started investigating how our brain 
works and started experimenting with the learning process using an artificial equivalent 
of neurons. 

Now when you have a structure of brain architecture, let's try to understand how we 
human learn something. I will take a simple example of golf, and how a golfer learns what 
the best force is to hit a ball. 

Learning steps: 


1. You hit the ball with some force (seed value of force, F1). 
2. The ball falls short of the hole, say by 3m (error is 3m). 


3. You know the ball fell short, so in next shot, you apply more 
force by delta (i.e. F2 = F1 + delta). 


4. The ball again falls short by 50 cm (error is 50 cm). 
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5. Again you found that the ball fell short, so you increase the 
force by delta (i.e., F3 = F2 + delta). 


6. Now you will observe the ball went went beyond on hole by 
2m (error -2 m). 


7. Now you know that too much force was applied, so you 
change the rate of force increase (learning rate), say deltaz2. 


8. Again you hit the ball with a new force with delta2 as 
improvement over the second shot (F4 = F2 + delta2). 


9. Nowthe ball falls very close to hole, say 25cm. 


10. Now you know the previous delta2 worked for you, so you try 
again with the same delta, and this time it goes into the hole. 


This process is simply based on learning and updating yourself for better results. 
There might be many ways to learn how to correct and by what magnitude to improve. 
This biological idea of learning from a large number of events is successfully translated 
by researchers into artificial neural network learning, the most powerful tool the data 
scientist has. 

Warren McCulloch and Walter Pitts (1943) paper entitled, “A Logical Calculus 
of Ideas Immanent in Nervous Activity” [12], laid the foundation of a computational 
framework for neural networks. After their path-breaking work, the further development 
of neural networks split into biological processes and machine learning (applied neural 
networks). 

It will help to look back at the biological architecture and learning method while 
reading through the rest of neural networks. 


6.11.2 Perceptron 


Perceptron is basic unit of artificial neural network that takes multiple inputs and 
produces binary outputs. In machine learning terminology, it is a supervised learning 
algorithm that can classify an input into binary 0 or 1 class. In simpler terms, it is a 
classification algorithm than can do classification based on a linear predictor function 
combining weights (parameters) of the feature vector. 
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Figure 6-57. Working of a perceptron (mathematically) 
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In machine learning, the perceptron is defined as a binary classifier function that 
maps its input x (a real-valued vector) to an output value f(x) (a single binary value): 


1 ifw-x+b>0 
f(x)= ; 
0 otherwise 


m 


WX; 
where w is a vector of real-valued weights, w-x is the dot product 2 , where m 


is the number of inputs to the perceptron and b is the bias. Bias is independent of input 
values and helps fix the decision boundary. 
The learning algorithm for a single perceptron can be stated as follows: 


1. Initialize the weights to some feasible values. 
2. For each data point in a training set, do Steps 3 and 4. 


3. Calculate the output with previous step weights. 


This is the output you will get with the current weights in the perceptron. 


4. Update the weights: 


w, (t+1)=w,(t)+(d,—y,(t))x,,;, for all features 0<i<n. 
J J 


ji? 


Did you observe any similarity with our golf example? 
5. In general, there can be three stopping criteria: 
e All the points in training set are exhausted 
e A preset number of iterations 
e Iteration error is less than a user-specified error threshold, y 


Iteration error is defined as follows: 


Sla, -y,(6) 


j=l 


Now let's explain a simple example using a sample perceptron to make it clear how 
powerful a classifier it can be. 

NAND gate is a Boolean operator that gives the value zero if and only if all the 
operands have a value of 1, and otherwise has a value of 1 (equivalent to NOT AND). 
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Figure 6-58. NAND gate operator 


Now we will try to recreate this NAND gate with the perceptron in Figure 6-59 and 
see if it gives us the same output as the previous output, by applying the weights logic. 





Figure 6-59. NAND gate perceptron 


There are two inputs to the perceptron x1 and x2 with possible values of 0 and 1. 
Hence, there can be four types of input and we know the output as well for NAND, as per 
Figure 6-61. 

The perceptron is a function of weights, and if the dot product of weights is greater 
than 1 then it gives output one. For our example, we chose the weights as wl=w2= -2 and 
bias = 3. Now let’s compute the perceptron for inputs and see the output. 


1. 00 (-2)0 + (-2)0 + 3 = 3 >0, outputis 1 
2. 01(-2)0 + (-2)1 +3 -= 1 >0, outputis 1 
3. 10(-2)1+(-2)1+3=1>50, outputis 1 
4. 11(-2)1+(-2)1+3=-1<0, output is 0 


Our perceptron just implemented a NAND gate! 
Now that the basic concepts of neural networks are set, we can jump to increasing 
the complexity and bring more power to these basic concepts. 


376 


CHAPTER 6 = MACHINE LEARNING THEORY AND PRACTICES 


6.11.3 Sigmoid Neuron 


Neural networks have a special kind of neurons, called sigmoid neurons. They allow a 
continuous output, which a perceptron does not provide. The output of sigmoid neuron 
is on a continuous scale. 

A sigmoid function is a mathematical function having an S-shaped curve (a sigmoid 
curve). The function is defined as follows: 


E l 
l+et 





S(t) 


Other examples/variations of the sigmoid function are the ogee curve, gompertz 
curve, and logistic curve. Sigmoids are used as activation functions (recall chemical 
synapse) in neural networks. 


0-5 





—6 —4 -2 0 2 4 6 
Figure 6-60. Sigmoid function 


In neural networks, a sigmoid neuron has multiple inputs x1, x2,.. , xn. But the output 
is on a scale of 0 to 1. Similar to perceptron, the sigmoid neuron has weights for each input, 
i.e., W1,w2,... and an overall bias. To draw similarity with perceptron, observe for very large 
input (input dot product with weights plus bias) the sigmoid perceptron tends to 1, the 
same as perceptron but asymptotically. This holds true for highly negative value as well. 

Now that we have covered the basic parts of the neural network, let’s discuss the 
architecture of neural networks in the next section. 


6.11.4 Neural Network Architecture 


The simple perceptron cannot do a good job of classification beyond linear ones, so see 
the following example to understand what we mean by linear separability. 
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(b) 


Figure 6-61. Linear seperability 


In Figure 6-61(a) on the left, you can draw a line (linear function of weights and bias) 
to separate + from -, but take a look at the image at right. In Figure 6-61(b), the + and - are 
not linearly separable. We need to expand our neural network architecture to include 
more perceptrons to do non-linear separation. 

Similar to what happens in biological systems to learn complicated things, we take 
the idea of network of neurons as network of perceptrons for our artificial neural network. 
Figure 6-62 shows a simple expansion of perceptrons to a network of perceptrons. 






hidden layers 


output layer 


input layer 


Figure 6-62. Artificial network architecture 
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This network is sometimes called Multi-Layer Perceptron (MLP). The leftmost layer 
is called the input layer, the rightmost layer is called the output layer, and the layer in 
between input and output is called the hidden layer. 

The hidden layer is different from the input layer as it does not have any direct 
input. While the number of input and output layer design and number is determined by 
the inputs and outputs respectively, finding the hidden layer design and number is not 
straightforward. The researchers have developed many design heuristics for the hidden 
layers; these different heuristics help the network behave the way they want it to. In this 
section it's good to talk about two more features of neural nets: 


e Feedforward Neural Networks (FFNN): As you can see from the 
simple architecture, if the input to each layer is in one direction 
we Call that network a feed-forward neural network. This network 
makes sure that there are no loops within the neural network. 
There are many other types of neural networks, especially deep 
learning has expanded the list, but this is the most generic 
framework. 


e Specialization versus Generalization: This is a general concept 
that relates to the complexity of architecture (size and number of 
hidden layers). If you have too many hidden layers/complicated 
architecture, the neural network tend to be very specialized; in 
machine learning terms, it overfits. This is called a specialized 
neural network. The other extreme is if you use simple 
architecture that the model will be very generalized and would 
not fit the data properly. A data scientist has to keep this balance 
in mind while designing the neural net. 


Artificial neural networks have three main components to set up the training 
exercise: 


e Architecture: Number of layers, weights matrix, bias, connections, 
etc. 


e Rules: Refer to the mechanism of how the neurons behave in 
response to signals from each other. 


e Learning rule: The way in which the neural network's weights 
change with time. 


In next section, we will touch upon supervised and unsupervised learning, which 
will relate to the concepts we have been learning in the book for dependent variables and 
no dependent variable (like clustering). 


6.11.5 Supervised versus Unsupervised Neural Nets 


We present a quick recap of this subject with an illustration by Andrew NG in his Coursera 
course. We show a simple and intuitive example to differentiate between supervised and 
unsupervised learning (see Figure 6-63). 


379 


CHAPTER 6 = MACHINE LEARNING THEORY AND PRACTICES 


Supervised Learning Unsupervised Learning 





Figure 6-63. Supervised versus unsupervised learning 


The image on the left has labeled the data as two different types, so the algorithm 
knows that the objects are different, while on the right we have the same objects but 
didn't tell the algorithm which is which. 

So in very simple terms, supervised learning is when we provide the machine 
learning algorithm the output against each input. While learninL is unsupervised when 
we don't supply the output and the algorithms themselves have to figure out the different 
set of outputs. 

Examples: 


e Supervised learning: HousePrice is given in our data against 
each input variable. Our learning algorithm will try to learn after 
multiple iterations on how to determine the house price based on 
underlying features (e.g., linear regression). 


e Unsupervised learning: We just provide the house feature data 
without a target variable. In that case, the algorithm will do 
categorization based on similar set of features (e.g., clustering). 


In next section, we will introduce a supervised learning algorithm for neural 
networks, and show an example of neural net in R. R is not one of the preferred platforms 
for neural network and deep leaning. We will limit ourselves to simple examples. 


6.11.6 Neural Network Learning Algorithms 


Learning algorithm determine how our machine learning process will choose a model 
for our underlying data. The general principle is to select the model that minimizes our 
cost function. A learning algorithm finds the best solution for problem by controlling the 
training of the neural networks. Most of the learning algorithms work on the principle of 
non-linear optimization and statistical estimation. 

Next, we touch upon the broader classed of learning algorithms for neural nets. 
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6.11.6.1 Evolutionary Methods 


Evolutionary methods are derived from the evolutionary process in biology, and 
evolution can be in terms of reproduction, mutation, selection, and recombination. 
A fitness function is used to determine the performance of model, and based on this 
function we select our final model. 

The steps involved in this learning method are as follows: 


1. Create a population of solutions (i.e., weights on all the inputs). 


2. Apply the fitness function to see how this initial population 
performed with initial population. 


3. Select the best solution set from Step 2, and then breed with 
other solutions (e.g., change weight on one variable with other 
solution). 


4. Evaluate again on the fitness function, and continue with 
Steps 3 and 4 until you get a solution. 


Genetic algorithms are inspired by this evolutionary process. 


6.11.6.2 Gene Expression Programming 


Gene expression programming is also a type of evolutionary learning algorithm. The 
learning method is inspired by home gene expression happens in biological body. The 
gene expression learning program are implemented as complex tree structures adapting 
to change in sizes, shape, and composition. 

Though this deemed to be an improvement over genetic algorithm, the general 
sentiment is that this has not been able to improve the learning results drastically. 
In computer programming, gene expression programming (GEP) is an evolutionary 
algorithm. 


6.11.6.3 Simulated Annealing 


Simulated annealing is a very different approach from the evolutionary approach. This 
method works on a probabilistic approach to approximate the global optimum for cost 
function. The method searches for a solution in large space with simulation. 

The steps are involved in this method are: 


1. Start the iteration with some random value/solution weights 


2. Ateach iteration, the algorithm gets probabilities to decide 
whether to stay in the same state or move to some neighbor 
state. 


3. Ifmoved to the next state, check the value of cost function. If 
it’s lower than the previous it was a successful move. 


4. Repeat Steps 2 and 3 until either you get the desired results or 
you want to stop the iterations. 
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This method uses heavy computation power. However, it is a good improvement over 
the issue of model convergence to local optimum due to lack of a probabilistic jump. 


6.11.6.4 Expectation Maximization 


Expectation minimization is a statistical learning method that uses an iterative method 
to find maximum likelihood or maximum posterior estimate. The algorithm typically 
process in two steps: 


1. Generating the Expectation function for log-likelihood using 
the current estimate for the parameters (take some random 
seed value for starting iterations). 


2. Maximize the Expectation function by tuning the parameters, 
and then use these parameters in the next iteration. 


These two steps, when done iteratively, cause the algorithm to converge to the 
parameters, maximizing the log-likelihood of the function. 


6.11.6.5 Non-Parametric Methods 


Non-parametric efforts are exactly the opposite of the expectation-maximization method. 
In non-parametric, we don’t make any assumptions on the underlying data distribution. 
This allows complex representation of the function as no constraints come from the 
distribution. 

In neural networks, the model is represented by an unknown function of weighted 
sum of several sigmoids, each of which is a function of explanatory variables. The 
algorithm then does a non-linear least square optimization to get the final weights of the 
underlying objective function. 


6.11.6.6 Particle Swarm Optimization 


The particle swarm optimization algorithm is developed by observing how birds flock or a 
fish school finds the best shape to move at the least resistance and highest velocity. In this 
algorithm, we have notion of position and velocity for particles. Particles are a population 
of candidate solutions. 

The algorithm tries to search the solution set in a large space, and each particle's 
movement is controlled by mathematical formula around velocity (how fast the flock can 
move?) and position (how the position of particle in the flock influences the velocity). Though 
it’s a very powerful learning methodology, it does not guarantee a global optimal solution. 


6.11.7 Feed-Forward Back-Propagation 


Back-propagation learning is one of the most popular learning methodologies in neural 
networks. This is also called the “back-propagation of errors” method. In conjunction 
with some optimization methods like the gradient descent method, this can be used to 
train artificial neural networks. 
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This is a supervised learning method, as the name suggests a propagation of errors. 
Recall the golf example. Though this method can be used for unsupervised learning, it 
largely remains the best method to train a feed-forward neural network. 

Another important point to consider here is generally this method works on the 
gradient descent principle, so the neuron function (activation function) should be 
differential. Otherwise the gradient descent cannot be calculated and the method fails. 
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Figure 6-64. Workings of the back-propagation method 


The algorithm can be simply executed using the following steps. We will give a 
mathematical representation of error correction when the sigmoid function is used as the 
activation function: 


1. Feed-forward the network with input and get the output. 


2. Backward propagation of output, to calculate delta at each 
neuron (error). 


3. Multiply the delta and input activation function to get the 
gradient of weight. 


4. Update the weight by subtracting a ratio from the gradient of 
the weight. 


This algorithm will be correcting for error in each iteration and coverage to a point 
where it has no more reducible error. 
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Mathematically, for each neuron j, its output 0, is defined as 
0; = p(net, )= of Eo 


To update the weight w, using gradient descent, you must choose a learning rate, 
a. The change in weight, which is added to the old weight, is equal to the product of the 
learning rate and the gradient, multiplied by -1: 


OE -a0;(0,-t, )o;(1-0,) ifj is an output neuron, 
Aw... = -Q ——= 


OW, |-ao, (È asw Jo o,(1-0,) ifj is an inner neuron. 
J 


The -1 is required in order to update in the direction of a minimum, not a maximum, 
of the error function. 


6.11.7.1 Purchase Prediction: Neural Network-Based 
Classification 


Let’s run our purchase prediction data with the nnet package in R and see how neural 
networks perform compared to our logistic regression example discussed in the 
regression section. 


#Load the data and prepare a dataset for logistic regression 
Data Purchase Prediction <-read.csw("Dataset/Purchase Prediction Dataset. 
csv",header=TRUE) ; 


Data Purchase Prediction$choice <-ifelse(Data Purchase 
Prediction$ProductChoice ==1,1, 

ifelse(Data Purchase Prediction$ProductChoice ==3,0,999)); 

Data Neural Net <-Data Purchase Prediction[Data Purchase Prediction$choice 
%in%že("0","1"),] 


#Remove Missing Values 


Data Neural Net <-na.omit(Data Neural Net) 
rownames(Data Neural Net) <-NULL 
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Usually scaling the continuous variables in the intervals [0,1] or [-1,1] tends to give 
better results. Convert the categorical variables into binary variables. 


#Transforming the continuous variables 
cont <-Data Neural 
Net[ ,¢("PurchaseTenure 


, CustomerAge , 


MembershipPoints", "IncomeClass") | 


maxs <-apply(cont, 2, max) 
mins <-apply(cont, 2, min) 


scaled cont <-as.data.frame(scale(cont, center = mins, scale = maxs -mins)) 


#The dependent variable 
dep <-factor(Data Neural Net$choice) 


Data Neural Net$ModeOfPayment <-factor(Data Neural Net$Mode0fPayment) ; 


flags ModeOfPayment =data.frame(Reduce(cbind, 
lapply(levels(Data Neural Net$ModeOfPayment), function(x){(Data Neural _ 
Net$ModeOfPayment ==x)*1}) 


)) 


names(flags ModeOfPayment) =levels(Data Neural Net$Mode0fPayment) 
Data Neural Net$CustomerPropensity <-factor(Data Neural _ 
Net$CustomerPropensity) ; 


flags CustomerPropensity =data.frame(Reduce(cbind, 
lapply(levels(Data Neural Net$CustomerPropensity), function(x){(Data Neural 
Net$CustomerPropensity ==x)*1}) 

)) 


names(flags CustomerPropensity) =levels(Data Neural Net$CustomerPropensity) 
cate <-cbind(flags ModeOfPayment, flags CustomerPropensity ) 


#Combine all data into single modeling data 
Dataset <-cbind(dep,scaled cont,cate) ; 


#Divide the data into train and test 

set.seed(917); 

index <-sample(1:nrow(Dataset) ,round(0.7*nrow(Dataset) ) ) 
train <-Dataset[ index, | 

test <-Dataset| -index, | 
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Now we will use the built-in back propagation algorithm from the nnet() package in R. 


library (nnet ) 

i <-names(train) 

form <-as.formula(paste("dep ~", paste(i[!i %in% "dep"], collapse =" + "))) 
nn <-nnet. formula(form,size=10,data=train) 


# weights: 181 

initial value 151866.965727 
iter 10 value 108709.305804 
iter 20 value 107666. 702615 
iter 30 value 107382.819447 
iter 40 value 107267.937386 
iter 50 value 107203.589847 
iter 60 value 107138.952084 
iter 70 value 107084. 361878 
iter 80 value 107037.998279 
iter 90 value 107003.328743 
iter 100 value 106970.152142 


final value 106970.152142 
Stopped after 100 iterations 
predict class <-predict(nn, newdata=test, type="class") 


#Classification table 
table(test$dep, predict class) 
predict_class 
0 1 

O 28776 13863 

1 11964 19534 
#Classification rate 
sum(diag(table(test$dep, predict class) )/mrow(test) ) 

[1] 0.6516314 


In the previous architecture, we used 10 neurons in one hidden layer. The accuracy 
comes out to be 65% which is 1% more than what we saw in logistic regression. Neural net 
has improved prediction on 0 while deteriorated on 1 (Do you want to try a ensemble? We 
will discuss this in Chapter 8.) 

Look at the neural net with 10 hidden neurons; it is able to improve prediction for Os. 
If you extend this training to deep learning, even a minuscule signal can be captured. In 
deep learning, we will run the same example with multi-layer deep architecture. 


library (NeuralNetTools) 

Warning: replacing previous import by ‘scales::alpha’ when loading 
‘NeuralNetTools' 

# Plot the neural network 

plotnet(nn) 
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Figure 6-65. One hidden layer neural network 


#get the neural weights 
neuralweights (nn) 
$struct 
[1] 16 10 1 


$wts 

$wts$` hidden 1 1° 

[1] -1.7688041 -20.6924206 2.3683340 0.3254776 0.3755354 
[6] -0.4381737 -0.9342264 -0.4396708 0.2488121 -0.8040053 
[11] -0.2513980 1.1595037 -0.5800809 0.9427963 -0.5210107 
[16] -0.5680854 0.9942396 


$wts$` hidden 1 2° 
[1] 0.3785581 2.7997630 0.0419642 -1.8159788 -2.0329127 0.2695198 
[7] 0.3923006 -2.1276359 0.3242286 0.4522314 0.5254541 1.2197842 
[13] -0.1996586 2.2651791 0.4066352 3.6192206 -5.2330743 


$wts$` hidden 1 3° 
[1] -1.4357242 -12.9881898 -12.3360008 1.1062240 3.6054822 
[6]  1.5317392 0.6969328 -6.2048082 0.9177840 -0.1734451 
1] 0.1648537  2.1053240 0.6816542 -2.9358718 -1.0474676 
6] -0.4098642 1.5974077 


$wts$ hidden 1 4° 

[1] -5.30486658 2.93556841 -9.97245085 0.30268208 6.59471280 
[6] 1.95089306 0.69071825 0.31481250 -0.06330620 -1.00934374 
[11] 0.93998141 -9.14052075 -6.52385269 -1.32746226 -1.07514308 
[16] 0.06271666 3.52729817 


1 
1 


O 


$wts$` hidden 1 5` 
[1] 2.24357572 -7.90629807 -2.19299184 0.78657421 -13.42029541 
[6] 1.35697587 0.76688140 -4.08706478 2.90349734 -0.59422438 
1]  2.21698054 -0.08467332 1.68745126 -0.43716182 -0.34025868 
6] -2.29645501 2.73500554 


| 
l 


1 
1 
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$wts$ hidden 1 6 
[1] -3.7195678 1.5885211 
[7] -1.0924695 0.3577909 
[13] 0.1024139 -5.6953417 


-3.3623349 -1.6354780 
-0.6803754 1.4831636 
4.9241476 


. 8999143 
- 9332748 
- 9350386 


0.9809355 
-0.4331445 
0.8179687 


$wts$ hidden 1 7° 
[1] -0.8225491 -4.8242434 
[7] -0.6713392 1.0763017 
[13] 1.3856182 -1.1600616 


-2.9266563 2. 
-0.2546451 
1.2339496 


0.1378938 -0.3450762 
-0.5570266 -0.2484610 
-0.7762755 


5035607 
8533341 
-1.2949715 


$wts$ hidden 1 8° 


[1] -3.86805085 2.35232847 -2.48545877 -0.14794972 0.07481260 


[6] 0.70845847 0.38961887 -2.34134097 -2.32810205 -0.80392872 
[11] -0.08502893 -1.81432815 0.05929793 -0.19809056 -0.27217330 
[16] 0.47082670 -4.67137272 
$wts$` hidden 1 9° 
[1] 0.80066147 2.72835254 -6.01889627 -10.63057306 7.63526853 
[6] -1.85188181 -0.59883189  0.86011432 2.28279639 -0.80140313 
[11] -3.41439405 4.47209147 3.98812529 0.05217016 1.42120448 
[16] -2.87977768 -1.80152670 
$wts$` hidden 1 10° 
[1] -1.41326881 -16.86494495 -0.25563167 0.02405375 -5.82554392 
[6] 0.20502350 0.68081754 -4.30017547 0.24592770 0.94533019 
[11]  0.51276882 -0.10970560 1.52611041 1.41750276 2.40763017 
[16] -1.56584208 -5.13504576 
$wts$ out 1° 


[1] -2.0906131 -0.8660608 2.5900163 -0.9717815 1.1467203 -0.8147543 


[7] 2.3220405 1.7924673 -3.5013152 0.2313364 -2.3259027 
# Plot the importance 
olden(nn) 
pi 
: 
___ oe 





i T 2 z s p 


Figure 6-66. Attribute importance by olden method 
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#variable importance by garson algorithm 
garson(nn) 
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Figure 6-67. Attribute importance by Garson method 


We showed how to create a neural network and test for its weight and prediction. R 
libraries have been expanding very fast in neural networks. It will be good for you to keep 
updated with the new tools being created by research community. Artificial Intelligence: 
A Modern Approach by Stuart Russell and Peter Norvig is a great book to dig deeper into 
artificial neural networks. 


6.11.8 Deep Learning 


We take a jump here from our generally used neural networks to very complex and large 
deep graphs, which can model high level abstraction in data. Deep learning consists of 
advanced algorithms having multiple layers, composed of multiple linear and non-linear 
transformations. Some scholars keep the deep learning algorithms in the bucket of machine 
learning methods based on learning representation of data, e.g., image, handwriting etc. 

There are multiple deep learning architectures used in the field of computer vision, 
automatic speech recognition, NLP, audio recognition, and other complicated areas. This 
is true to taking machine learning close to artificial intelligence. Some of the well known 
deep learning architecture includes deep neural nets, convolution deep neural networks, 
deep belief, recurrent neural networks, and others. Lots of advancement in deep learning 
is coming from neuroscience, and researchers are bringing more advanced ways to 
represent data and create deep learning models to understand these representations. 

Rina Dechter introduced first order deep learning and second order deep learning 
in her work titled, “Learning While Searching in Constraint-Satisfaction Problems,” 1986. 
University of California, Computer Science Department, Cognitive Systems Laboratory. 
Recently the definition of deep learning algorithms has been expanded to include 
algorithms that generally follow these guidelines: 


e Use many layers of nonlinear processing units for feature 
extraction and transformation 


e Are based on the (unsupervised) learning of multiple levels of 
features or representations of the data 
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e Belong to the field of learning representations of data 


e Hierarchy of concepts; earning from multiple levels of 
representation corresponding to higher levels of abstraction 


The architecture of deep neural networks is very complex. There can be multiple 
hidden layers and advanced methods of learning. In previous discussions of neural 
networks, we were focused on basic types of networks with only one hidden layer. In deep 
neural networks, the layers become many-fold and the network can process at a very high 
level of data abstraction. 
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Figure 6-68. A multi-layer deep neural network 


In Figure 6-68, you can see the network has become very complicated and has 
multiple hidden layers. In general, adding more layers and neurons per layer increases 
the specialization of neural network to train data and decreases the performance on test 
data. This points out two issues with deep neural networks: 


e Specialization (overfitting): Too many layers of abstraction make 
the model learn the training data as if there were no or very little 
variation can happen to that. In these cases, the model does not 
return good results on testing data. 


e Computational cost: Adding layers and neurons costs a lot on 
computational resources, both time and memory. Because of this, 
deep neural networks are developed on clusters and large servers. 


There are many popular architectures of neural network used in different 
applications; here are some of them: 


e Convolutional neural networks: Used in pictures and other two 
dimensional data 


e Recurrent Neural Networks: Used for time series data as they can 
retain history (memory compressor) 
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e Recursive Neural Networks: Used in natural language processing 


e Deep Belief Networks: Probabilistic and generative models, used 
for image and signal processing 


There are lot of other architecture and learning algorithms. We will show a basic 
example a simple multi-layer neural network using darch package in R and will also show 
an example of image classification with the mxNet package. 

Deep learning has been the focus of many researchers and machine learning 
professionals; however R is not yet developed enough tools to run various deep learning 
algorithms. Another reason for that is deep learning is so resource intensive that models 
can be trained only on large clusters and not on workstations. 

There are few packages out there in R that can do deep learning (by the way a neural 
net with multiple hidden layers is also a deep learning framework): 


e H2O implements feed forward neural nets and auto encoders 


e DeepNet implements deep neural networks, deep belief 
networks, and restricted Boltzmann machines 


e mxNet implements complex deep nets for image classification 
using convolutional networks 


e darch implements deep neural nets and restricted Boltzmann 
machines 


Here, we discuss two examples of deep learning using R. 
1. darch for classification 


Our first example is using a deep architecture to logistic model we discussed in 
our regression section. The dependent variable being choice and the independent 
variables being PurchaseTenure, CustomerAge, MemebershipPoints, IncomeClass, 
ModeOfPayment, and CustomerPropensity. 

All the continuous variables are scaled and categorical variables converted into 
binary variables. For example, see the following pre- and post-transformation data 
matrix. 


#Pre-transformation 

head(Data Purchase 

Prediction|[ ,¢("choice", "PurchaseTenure", "CustomerAge", " 
"IncomeClass", "ModeOfPayment", "CustomerPropensity") ]) 


choice PurchaseTenure CustomerAge MembershipPoints IncomeClass 


MembershipPoints", 


1 999 4 55 6 4 
2 0 4 75 2 7 
3 999 10 34 4 5 
4 0 6 26 2 4 
5 999 3 38 6 7 
6 0 3 71 6 4 
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ModeOfPayment CustomerPropensity 


1 MoneyWallet Medium 
2 CreditCard VeryHigh 
3 MoneyWallet Unknown 
4 MoneyWallet Low 
5  MoneyWallet VeryHigh 
6 DebitCard High 
#Post-transformation 
head(train) 
dep PurchaseTenure CustomerAge MembershipPoints IncomeClass 
210877 1 0.01176471 0.7017544 0.08333333 0.625 
233397 0 0.02352941 0.1578947 O . 25000000 0.500 
53282 0 0.08235294 0.7192982 0.16666667 0.750 
176631 O 0.22352941 0.6315789 0.41666667 0.500 
219592 © 0.02352941 0.2807018 0.16666667 0.500 
40929 1 0.08235294 0.1929825 0.58333333 0.625 
BankTransfer Cash CashPoints CreditCard DebitCard MoneyWallet 
210877 0 0 0 0 1 0 
233397 0 0 0 0 1 0 
53282 0 0 0 1 0 0 
176631 1 0 0 0 0 0 
219592 O 0 O 0 0 1 
40929 0 0 0 0 0 1 
Voucher High Low Medium Unknown VeryHigh 
210877 0 Oo O 0 1 0 
233397 0 0 o0 1 0 0 
53282 0 1 0 0 0 0 
176631 0 Oo O 0 0 1 
219592 0 O 1 0 0 0 
40929 0 Oo O 0 0 1 


#We will us the same data as of previous example in neural network 
devtools: :install_github("maddin79/darch" ) 

library (darch) 

library (mlbench) 

library (RANN) 


#Print the model formula 
form 


#Apply the model using deep neural net with 
# deep net <- darch(form, train, 


# preProc.params = list("method" = c("knnImpute")), 

# layers = c(0,10,30,10,0), 

# darch.batchSize = 1, 

# darch.returnBestModel.validationErrorFactor = 1, 

# darch.fineTuneFunction = "rpropagation", 

# darch.unitFunction = c("tanhUnit", "tanhUnit", "tanhUnit", "s 


oftmaxUnit"), 
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# darch.numEpochs = 15, 
# bootstrap = T, 
# bootstrap.num = 500) 


deep_net <-darch(form,train, 

preProc.params =list(method =e("center", "scale")), 

layers =c(0,10,30,10,0), 

darch.unitFunction =e("sigmoidUnit", "tanhUnit","tanhUnit","softmaxUnit"), 
darch.fineTuneFunction ="minimizeClassifier", 

darch.numEpochs =15, 

cg. length =3, cg.switchLayers =5) 


#Plot the deep net 
library(NeuralNetTools) 
plot(deep_net,"net") 


result <-darchTest(deep_net, newdata = test) 
result 


A good reference for darch can be found in CRAN and can be read in this short 
article at http://static.saviola.de/publications/rueckert_2016.pdf. 


2. mxNet image classification 


We will show a popular example for already trained image classification model. The 
mxNet package comes with already trained Inception-Batch Norm Network model, which 
can predict the class of real-world image. 

The pre-trained model is provided separately for you. Also, the mxNet is not 
available on CRAN, so install it using the following command. The following example has 
been recreated from the Git repository data from the mxnet project at https ://github. 
com/dmlc/mxnet/tree/master/R-package and https://github.com/dahtah/imager. 


install.packages("drat", repos="https://cran.rstudio.com") 
drat: : :addRepo("dmlc") 
install.packages("mxnet") 


#Please refer https://github.com/dahtah/imager 
install.packages("devtools") 

devtools: :install_github("dahtah/imager") 
library(mxnet) 


#install imager for loading images 
library(imager) 


#load the pre-trained model 
model <-mx.model.load("Inception/Inception BN", iteration=39) 
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#We also need to load in the mean image, which is used for preprocessing 
using mx.nd.load. 

mean.img =as.array(mx.nd.load("Inception/mean_ 224.nd")[["mean_img" ] ]) 
#Load and plot the image: (Default parrot image) 


#im <- load. image(system. file("extdata/parrots.png", package="imager")) 
im <-load.image("Images/russia-volcano. jpg") 
plot (im) 
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Figure 6-69. A sample volcano picture for the image recognition exercise 


Now we will change this image to be able to pass it into the model. 


preproc.image <-function(im, mean.image) { 
# crop the image 
shape <-dim(im) 
short.edge <-min(shape[1:2]) 
xx <-floor((shape[1] -short.edge) /2) 
yy <-floor((shape[2] -short.edge) /2) 
cropped <-crop.borders(im, xx, yy) 
# resize to 224 x 224, needed by input of the model. 
resized <-resize(cropped, 224, 224) 
# convert to array (x, y, channel) 
arr <-aS.array(resized) *255 
dim(arr) <-¢(224, 224, 3) 
# subtract the mean 
normed <-arr -mean.img 
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# Reshape to format needed by mxnet (width, height, channel, num) 
dim(normed) <-¢(224, 224, 3, 1) 
return(normed) 


#Now pass our image to pre-process 
normed <-preproc.image(im, mean. img) 
plot (normed) 
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Figure 6-70. Normalized image 


The next step is to classify the image. 
prob <- predict(model, X=normed) 
#We can extract the top-5 class index. 


max.idx <- order(prob[,1], decreasing = TRUE)[1:5] 
max. idx 

[1] "981" "980" "971" "673" "985" 
synsets <-readLines("Inception/synset.txt") 


#And let us print the corresponding lines: 


print (paste0("Predicted Top-classes: ", synsets[as.snumeric(max.idx) | )) 
[1] "Predicted Top-classes: n09472597 volcano" 


[2] "Predicted Top-classes: n09468604 valley, vale" 
[3] "Predicted Top-classes: n09193705 alp" 

[4] "Predicted Top-classes: n03792972 mountain tent" 
[5] "Predicted Top-classes: n11879895 rapeseed" 


395 


CHAPTER 6 = MACHINE LEARNING THEORY AND PRACTICES 


You can see the deep learning algorithm has detected the volcano in the image. You 
can repeat this experiment with other images and see what images classification you get. 
Also note that you need to be updated with latest version on Git. 


6.11.9 Conclusion 


Neural networks are very powerful tools that can learn from any dataset without any 
assumptions on input data. Further, the new research in their architecture and learning 
methods has given rise to deep neural networks. This has enabled the whole field of deep 
learning in various fields, specifically the fields having high volume and high abstraction 
in data. Deep neural nets are making possible computer vision, speech recognition, gene 
matching, and other complex problems. 

In the next section, we will delve into the world of unstructured data. You will see 
how some of the simple techniques could transform a completely unstructured textual 
data to matrix of numerical observations, which then could be used with many other 
algorithms for classification, clustering, and so on. 


6.12 Text-Mining Approaches 


In recent years, the text data has been increased to manifold. Particularly, the digitally 
generated or digitally stored text data has increased a lot. A big part of big data world is 
this text data is generated and stored in large volumes. Another important aspect of text 
data is that now the data can be generated by anybody and have implications on business. 
For example, a bad product review can damage the market image of the product or 
a social media post about a social cause can create a campaign. In all these cases, text 
data plays a pivotal role of influencing behavior. In the 21st century, it becomes important 
for organizations to invest in text data and understand what insights it has on consumer 
behavior or product performance. 
Brandwatch (https: //www. brandwatch.com/2016/05/44-twitter-stats-2016/) 
published data around Twitter statics; let’s look at some of it. 


e Twitter has 310 million active users (each user is a source of text data) 


e 83% of world leaders are on Twitter (leaders tweets are text data 
that influences markets, people, policies, and so on) 


e 500 million tweets are sent daily (isn’t this big data?) 


e 65.8% of U.S. companies with 100+ employees use Twitter for 
marketing (how can we use this data to manage and outshine in 
the marketing programs?) 


e 80% of Twitter users have mentioned a brand in a tweet (doesn’t 
this compel us to look at the treasure of information hidden in 
text data?) 


These statistics tell us that text data is important to analyze in today's world. Being 
massive in nature, we need advanced machine learning methods and enhanced natural 
language processing to harness the power of text data. Some statistics suggest 80% of the 
information we store today is in text format, signifying the commercial value of text mining. 
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Formally, text analysis involves information retrieval, lexical analysis to study 
word frequency distributions, tagging/annotation, information extraction, data mining 
techniques including link and association analysis, visualization, pattern recognition and 
predictive analytics. The end goal is to use unstructured data in text, and convert that into 
data for analysis by using powerful techniques of Natural Language Processing and other 
mathematical methods (e.g., frequency plots, Singular Value Decomposition, etc.). 

In this section, we will introduce basic of text analytics using R. Toward the end of 
chapter, we will show an example of how to use Microsoft API to unlock powerful text 
mining tools that are currently not available in R. 


6.12.1 Introduction to Text Mining 


The explosion in amount of unstructured data has led to numerous use cases on text 
mining. The ability to process textual data really fast and convert it into a numeric feature 
matrix has opened up a plethora of machine learning algorithms to be used on such data. 
The field of Natural Language Processing (NLP), though a vast field, could be thought of 
as a subfield of ML. In an alternative view, the text mining approaches help in turning text 
into data for analysis, via the application of NLP and analytical methods. 

In the following section, we will go a little deeper into text mining concepts like 
text categorization, summarization, TF-IDF, Part of Speech (POS) tagging, and simple 
visualization using WordCloud. 

We will use the Amazon Fine food reviews dataset for couple of text mining 
approaches. 

Let’s start by looking at the data briefly and then choose a smaller subset for all the 
demonstrations 


a. Data Summary 


library (data.table) 
fine food data <-read.csw("Dataset/Food Reviews.csv", 
stringsAsFactors =FALSE) 


fine food data$Score <-as.factor(fine food data$Score) 


str(fine food data[-10]) 


‘data. frame": 35173 obs. of 9 variables: 

$ Id : int 12345678910... 

$ ProductId : chr "BOO1E4KFGO" "BOO813GRG4" "“BOOOLQOCHO" 
"BOOOUAOOIO" ... 

$ UserlId : chr "A3SGXH7AUHU8GW" "A1D87F6ZCVESNK" 
“ABXLMWJIXXAIN" "A395BORC6FGVXV" ... 

$ ProfileName : chr ‘delmartian” “dll pa" “Natalia Corres 
\"Natalia Corres\"" "Karl" ... 


$ HelpfulnessNumerator : int 1013000010... 

$ HelpfulnessDenominator: int 1013000010... 

$ Score : Factor w/ 5 levels "1",°2","3","4",..: 51425 
45555... 
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$ Time : int 1303862400 1346976000 1219017600 1307923200 
1350777600 1342051200 1340150400 1336003200 1322006400 1351209600 ... 
$ Summary : chr ‘Good Quality Dog Food" "Not as Advertised" 


"\"Delight\" says it all" "Cough Medicine" ... 
# Last column - Customer review in free text 


head(fine food datal[ ,10],2) 

[1] "I have bought several of the Vitality canned dog food products and 
have found them all to be of good quality. The product looks more like a 
stew than processed meat and it smells better. My Labrador is finicky and 
she appreciates this product better than most.” 

[2] "Product arrived labeled as Jumbo Salted Peanuts...the peanuts were 
actually small sized unsalted. Not sure if this was an error or if the 
vendor intended to represent the product as \"Jumbo\"." 


b. Data Preparation 


library (caTools) 


# Randomly split data and use only 10% of the dataset 
set.seed(90) 
split =sample.split(fine food data$Score, SplitRatio =0.10) 


fine food data =subset(fine food data, split ==TRUE) 
select_col <-e("Id","HelpfulnessNumerator", "HelpfulnessDenominator" , "Score" 
» "Summary", "Text" ) 


fine food data selected <-fine food data[,select_ col] 


6.12.2 Text Summarization 


This applies the method of Gong & Liu (2001) for generic text summarization of text 
document D via latent semantic analysis: 


1. Decompose the document D into individual sentences and 
use these sentences to form the candidate sentence set S and 
setk=1. 


2. Construct the terms by sentences matrix A for the 
document D. 


3. Perform the SVD on A to obtain the singular value matrix, 
and the right singular vector matrix V^t. In the singular vector 
space, each sentence i is represented by the column vector. 


4. Select the k'th right singular vector from matrix Vt. 


5. Select the sentence that has the largest index value with the 
k'th right singular vector and include it in the summary. 
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6. Ifkreaches the predefined number, terminate the operation; 
otherwise, increment k by 1 and go back to Step 4. 


(Cited directly from Gong & Liu, 2001, p. 21)[9] 

Let’s see how good the summarization works here in our Amazon fine food review 
dataset. In order to compare our results, we will use the summary attribute in the dataset 
and do a qualitative assessment of the output. 


a. Original Text 
fine food data selected[2,6] 


[1] "McCann's Instant Oatmeal is great if you must have your 
oatmeal but can only scrape together two or three minutes to 
prepare it. There is no escaping the fact, however, that even 
the best instant oatmeal is nowhere near as good as even a 
store brand of oatmeal requiring stovetop preparation. Still, 
the McCann's is as good as it gets for instant oatmeal. It's even 
better than the organic, all-natural brands I have tried. All the 
varieties in the McCann's variety pack taste good. It can be 
prepared in the microwave or by adding boiling water so it is 
convenient in the extreme when time is an issue.<br /><br /> 
McCann's use of actual cane sugar instead of high fructose 
corn syrup helped me decide to buy this product. Real sugar 
tastes better and is not as harmful as the other stuff. One thing 
I do not like, though, is McCann's use of thickeners. Oats plus 
water plus heat should make a creamy, tasty oatmeal without 
the need for guar gum. But this is a convenience product. 
Maybe the guar gum is why, after sitting in the bowl a while, 
the instant McCann's becomes too thick and gluey." 


b. Summary generated by genericSummary 


library (LSAfun) 
genericSummary(fine food data selected[2,6],k=1) 


[1] " There is no escaping the fact, however, that even the best 
instant oatmeal is nowhere near as good as even a store brand 
of oatmeal requiring stovetop preparation." 


C. Multiple summaries generated by genericSummary 


library (LSAfun) 
genericSummary(fine food data selected[2,6],k=2) 


[1] " There is no escaping the fact, however, that even the best 


instant oatmeal is nowhere near as good as even a store brand 
of oatmeal requiring stovetop preparation." 
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[2] " It can be prepared in the microwave or by adding boiling 
water so it is convenient in the extreme when time is an issue." 


d. Summary from the dataset 
fine food data selected[2,5] 
[1] "Best of the Instant Oatmeals" 


Observe the striking similarity of context of the text and the 
summary generated by the function. Text summarization has 
many wide ranging application. Google uses it to display the 
most relevant piece of information while returning the query 
results from a given web page, a lot of NLP approaches deal 
with text summary rather than processing the large chuck of 
textual data, Facebook could build use cases to automatically 
summarize the user post(ensuring the anonymity) to target 
the right ads and many more such applications. 


6.12.3 TF-IDF 


Term Frequency/Inverse Term frequency (TF_IDF) is the frequency of words, which 
is key in terms of transforming the bag of words into numeric matrix, thus allowing for 
many ML algorithms to be applied to them. 


a. Term frequency /f,, counts the number of occurrences n, of 
a term t in a document d.. In the case of normalization, the 
term frequency if. is divided by kn, .. 


b. Inverse document frequency idf, a term t_iis defined as 


D 
idf, =log, aca 





where |D| denotes the total number of documents and 
(dle, = d} is the number of documents where the term t, 


appears. 
Intuitively, if you see, I has two properties: 


e Certain terms that occur too frequently have little power in 
determining the reliance of a document. idf weigh down the too 
frequently occurring word. 


e The terms that occurs just few times in a document has more 
relevance. idf weigh up the less frequently occurring word. 
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For example, in a collection of document related to sport, the 
word “game” might be too frequent word, however any article 
with word “cricket” might show a high relevance to classify the 
article into a particular game. 


c. Term frequency/inverse document frequency(TF-IDF) is 
the product of tf,- idf, 


Let’s create a tf-idf matrix from the bag-of-word approach in text mining. A tf-idf 
matrix is a numerical representation of a collection of documents (represented by row) 
and words contained in it (represented by columns). 


library (tm) 

Warning: package ‘tm’ was built under R version 3.2.3 

Loading required package: NLP 
fine food data corpus <-VCorpus(VectorSource(fine food data _selected$Text) ) 


#Standardize the text - Pre-Processing 


fine food data text_dtm <-DocumentTermMatrix(fine food data corpus, control 
=list( 

tolower =TRUE, 

removeNumbers =TRUE, 

stopwords =TRUE, 

removePunctuation =TRUE, 

stemming =TRUE 


)) 


# save frequently-appearing terms( more than 500 times) to a character 
vector 
fine food data text freq <-findFreqTerms(fine food data text _dtm, 500) 


# create DIMs with only the frequent terms 
fine food data text_dtm <-fine food data text_dtm[ , fine food data text_ 
freq] 


tm: :inspect(fine food data text_dtm[1:5,1:10]) 
<<DocumentTermMatrix (documents: 5, terms: 10)>> 
Non-/sparse entries: 8/42 


Sparsity : 84% 

Maximal term length: 6 

Weighting : term frequency (tf) 
Terms 
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Docs also bag buy can coffee dog eat find flavor food 


1 1 0 O O oOo O O 0 0 0 
2 Oo O 1 2 oOo O O 0 0 0 
3 0 0 O 0 2 0 0O 0 0 0 
4 0 O O 0 O oO 1 1 0 0 
5 0 O O 0 oOo O O 1 2 0 


#Create a tf-idf matrix 
fine food data tfidf <-weightTfIdf(fine food data text _dtm, normalize 
=FALSE) 


tm::inspect(fine food data tfidf[1:5,1:10]) 
<<DocumentTermMatrix (documents: 5, terms: 10)>> 
Non-/sparse entries: 8/42 


Sparsity : 84% 

Maximal term length: 6 

Weighting : term frequency - inverse document frequency (tf-idf) 
Terms 

Docs also bag buy can coffee dog eat find flavor 


1 3.04583 O 0.000000 0.000000 0.00000 O 0.000000 0.000000 0.000000 

2 0.00000 O 2.635882 4.525741 0.00000 O 0.000000 0.000000 0.000000 

3 0.00000 O 0.000000 0.000000 5.82035 O 0.000000 0.000000 0.000000 

4 0.00000 O 0.000000 0.000000 0.00000 O 2.960361 2.992637 0.000000 

5 0.00000 O 0.000000 0.000000 0.00000 O 0.000000 2.992637 4.024711 
Terms 
Docs food 
1 0 


uw BB Ww N 
O OOO 


6.12.4 Part-of-Speech (POS) Tagging 


Parts of speech are useful features for finding named entities like people or organizations 
in a text and other information extraction tasks. This could help in classifying named 
entities in text into categories like persons, company, locations, expression of time, and so 
on. This is found in many applications in molecular biology, bioinformatics, and medical 
communities. 
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We will use the Amazon food review dataset to extract POS tags using R. Figure 6-71 
shows the mappings of the abbreviations of the PoS produced by the R script to the part- 
of-speech (POS) in the English language. 


Tag Description Description 

a OE.. 
po Carina vente Pa re wm 
| — a — 


FW Foreign word | Particle 











IN | Preposition or subordinating conjunction 
Adjective, superlative a Verb, base form 


11 MD | Modal Verb, gerund or present participle 
12 NN Noun, singular or mass 30 VBN Verb, past participle 
13 | NNS Noun, plural J VBP | Verb, non-3rd person singular present 


NNPS Proper noun, plural ERLI Wh-determiner 
a [we Wipro 
Possessive ending Possessive wh-pronoun 


Figure 6-71. Part-of-speech mapping 





a. Pre-processing 


library("NLP") 
library (tm) 


fine food data corpus <-Corpus(VectorSource(fine food data_ 
selected$Text[1:3])) 
fine food data cleaned <-tm_map(fine food data corpus, PlainTextDocument) 


#tolwer 
fine_food_data_cleaned <-tm_map(fine_food_data_cleaned, tolower) 
fine food data cleaned[[1] | 

[1] “twizzlers, strawberry my childhood favorite candy, made in lancaster 
pennsylvania by y & s candies, inc. one of the oldest confectionery firms 
in the united states, now a subsidiary of the hershey company, the company 
was eStablished in 1845 as young and smylie, they also make apple licorice 
twists, green color and blue raspberry licorice twists, i like them all<br 
/><br />i keep it in a dry cool place because is not recommended it to put 
it in the fridge. according to the guinness book of records, the longest 
licorice twist ever made measured 1.200 feet (370 m) and weighted 100 pounds 
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(45 kg) and was made by y & s candies, inc. this record-breaking twist 
became a guinness world record on july 19, 1998. this product is kosher! 
thank you" 
fine_food_data_cleaned <-tm_map(fine_food_data_cleaned, removelWords, 
stopwords("english") ) 
fine food data cleaned[[1] | 

[1] "twizzlers, strawberry childhood favorite candy, made lancaster 
pennsylvania y & s candies, inc. one oldest confectionery firms united 
states, now subsidiary hershey company, company established 1845 
young smylie, also make apple licorice twists, green color blue raspberry 
licorice twists, like <br /><br /> keep dry cool place recommended 
put fridge. according guinness book records, longest licorice twist 
ever made measured 1.200 feet (370 m) weighted 100 pounds (45 kg) made y 
& s candies, inc. record-breaking twist became guinness world record july 
19, 1998. product kosher! thank " 
fine_food_data_cleaned <-tm_map(fine_food_data_cleaned, removePunctuation) 
fine food data cleaned[[1] | 

[1] "twizzlers strawberry childhood favorite candy made lancaster 
pennsylvania y s candies inc one oldest confectionery firms united 
States now subsidiary hershey company company established 1845 young 
smylie also make apple licorice twists green color blue raspberry licorice 
twists like br br keep dry cool place recommended put fridge 
according guinness book records longest licorice twist ever made 
measured 1200 feet 370 m weighted 100 pounds 45 kg made y s candies inc 
recordbreaking twist became guinness world record july 19 1998 product 
kosher thank " 
fine_food_data_cleaned <-tm_map(fine_food_data_cleaned, removeNumbers) 
fine food data cleaned[[1] | 

[1] "twizzlers strawberry childhood favorite candy made lancaster 
pennsylvania y s candies inc one oldest confectionery firms united 
states now subsidiary hershey company company established young 
smylie also make apple licorice twists green color blue raspberry 
licorice twists like br br keep dry cool place recommended put 
fridge according guinness book records longest licorice twist ever 
made measured feet m weighted pounds kg made y s candies inc 
recordbreaking twist became guinness world record july product kosher 
thank " 
fine_food_data_cleaned <-tm_map(fine_food_data_cleaned, stripWhitespace) 
fine food data cleaned[[1] | 

[1] "twizzlers strawberry childhood favorite candy made lancaster 
pennsylvania y s candies inc one oldest confectionery firms united states 
now subsidiary hershey company company established young smylie also make 
apple licorice twists green color blue raspberry licorice twists like br br 
keep dry cool place recommended put fridge according guinness book records 
longest licorice twist ever made measured feet m weighted pounds kg made y s 
candies inc recordbreaking twist became guinness world record july product 
kosher thank " 
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b. PoS extraction 


library (openNLP) 
Warning: package ‘openNLP’ was built under R version 3.2.3 
library (NLP) 


fine food data string <-NLP::as.String(fine food data cleaned[[1]]) 


sent_token_annotator <-Maxent_Sent_Token_Annotator( ) 

word token annotator <-Maxent_Word_Token_Annotator ( ) 

fine food data string an <-annotate(fine food data string, list(sent_token_ 
annotator, word token _annotator) ) 


pos tag annotator <-Maxent_POS_Tag Annotator( ) 
fine food data string an2 <-annotate(fine food data string, pos tag_ 
annotator, fine food data string an) 


Variant with POS tag probabilities as (additional) features. 
head(annotate(fine food data string, Maxent_POS Tag Annotator(probs =TRUE), 
fine food data string an2)) 

id type start end features 


1 sentence 1 524 constituents=<<integer, 77>> 

2 word 1 9 POS=NNS, POS=NNS, POS prob=0.7822268 
3 word 11 20 POS=VBP, POS=VBP, POS prob=0.3488425 
4 word 22 30 POS=NN, POS=NN, POS prob=0.8055908 

5 word 32 39 POS=JJ, POS=JJ, POS prob=0.6114238 

6 word 41 45 POS=NN, POS=NN, POS prob=0.9833723 


Determine the distribution of POS tags for word tokens. 
fine food data string an2w <-subset(fine food data string an2, type == 
"word" ) 
tags <-sapply(fine food data string an2w$features, ~[[~, "POS") 
table(tags) 
tags 
» CC CD IN JJ JJS NN NNS RB VB VBD VBG VBN VBP VBZ 
1 2 1 1 10 2 2 9 5 1 6 2 4 2 3 
plot(table(tags), type ="h", xlab="Part-Of Speech", ylab ="Frequency") 
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Frequency 


CD JJ NN RB VBD VBN VBZ 


Part-Of_ Speech 


Figure 6-72. Part of speech frequency 


Extract token/POS pairs (all of them) 
head(sprintf("%s/%s", fine food data string[fine food data string an2w], 


tags), 15) 
[1] "“twizzlers/NNS" "strawberry/VBP" = ""childhood/NN" 
[4] "favorite/JJ" "candy/NN" "made/VBD" 
[7] “lancaster/NN" "pennsylvania/NN"  "y/RB" 
[10] "s/VBZ" "candies/NNS" "inc/CC" 
[13] “one/CD" "oldest/JJS" "confectionery/NN" 


Noun (NN) seems to be the frequently used part-of-speech, followed by Adjectives 
(JJ) in this data. It makes a lot of intuitive sense, since in review related data, people talk 
about restaurants and food and their characteristics like “good,” “bad,” “awesome,” and so 
on. Such POS identification could help in better understanding the reviews than reading 
the entire textual information. 


6.12.5 Word Cloud 


The word cloud helps in visualizing the words most frequently being used in the reviews: 


library (Snowbal1C) 
library (wordcloud) 


fine food data corpus <-VCorpus(VectorSource(fine food data selected$Text) ) 
fine food data text_tdm <-TermDocumentMatrix(fine food data corpus, control 


=list( 
tolower =TRUE, 
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removeNumbers =TRUE, 
stopwords =TRUE, 
removePunctuation =TRUE, 
stemming =TRUE 


)) 


wc_tdm <- rollup(fine food data text_tdm,2,na.rm=TRUE, FUN=sum) 


matrix _c <-as.matrix(wc_ tdm) 
wc_freq <-sort(rowSums(matrix _c)) 
wc_tmdata <-data.frame(words=names(wc freq), wc freq) 


wc_tmdata <-na.omit(wc_tmdata) 


wordcloud (tail(wc_tmdata$words,100), tail(wc_tmdata$wc_freq,100), random. 


order=FALSE, colors=brewer.pal(8, "Dark2")) 


tak a seemfound pack S 
nk eat , m sweet 
biela drink e uch salt $ v 


them time‘ t greata give & 
2. 5 love will look 


ae 3 œ c tastuss® Say 


D best 


#2 slike o Som 


ai “flavors Dong 


k~ isacoffe] just 


yae ag Calley well difer 


nails” realli better way 


a an gar treat recommend 


Figure 6-73. Word cloud using Amazon Food Review dataset 


WordCloud is a simple exploratory tool to understand the general trend in the word 


usage, which could further help in building intuitions and insights. 


6.12.6 Text Analysis: Microsoft Cognitive Services 


In this section, we will introduce you to the powerful world of text analytics by using a 
third-party API called from within R. We will be using Microsoft Cognitive Services API to 
show some real-time analysis of text from the Twitter feed of a news agency. 
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Note Microsoft Cognitive Services are chosen to show some real-world examples of 
text analytics. We do not endorse any third-party tool or services. 


Microsoft Cognitive Services is a machine intelligence service from Microsoft. It 
was previously known as Project Oxford. This service provide a cloud-based APIs for 
developers to do lot of high-end functions like face recognition, speech recognition, 
text mining, video feed analysis, and many others. We will be using their free developer 
service to show some text analytics features, which will include the following; 


e Sentiment analysis: What is the sentiment of tweet? Is it positive 
or negative or neutral? 


e Topic detection: What the topic of discussion is a document? 


e Language detection: Can you just provide something written and 
it shows you which language it is? 


e Summarization: Can we automatically summarize a big 
document to make it manageable to read? 


We will be using Twitter feeds for sentiment analysis and topic detection, some 
random text from a language for language detection, and an article to summarize it. 

To start with this example, we need to set up an account with Microsoft cognitive 
service, and get an API key to work with their REST API. The key can be obtained by 
registering at https: //www.microsoft.com/cognitive-services/. 

You will also need a Twitter developer account to set up application in R to extract 
tweets. You can get a Twitter API key from registering at https: //apps.twitter.com/. 

First we will set up the TwitterR package by using API Key we got from the Twitter 
apps. The twitterR() package provides an interface to the Twitter web API. 


library("stringr") 
library("dplyr") 


library("twitteR") 
#getTwitterOAuth(consumer_key, consumer secret) 
consumerKey <- "INSERT KEY" 

consumerSecret <- "INSERT SECRET CODE" 


#Below two tokens need to be used when you want to pull tweets from your own 
account 

accessToken <- "INSERT ACCESS TOKEN” 

accessTokenSecret <- "INSERT SECRET CODE" 
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setup_twitter_oauth(consumerKey, consumerSecret ,accessToken, accessTokenSecr 
et) 

[1] "Using direct authentication" 
kIgnoreTweet <- "update: |nobot:" 


GetTweets <-function(handle, n =1000) { 


timeline <-userTimeline(handle, n = n) 
tweets <-sapply(timeline, function(x) { 
c(x$getText(), x$getCreated() ) 


tweets <-data.frame(t(tweets) ) 
names(tweets) <-e("text.orig", "created.orig") 


tweets$text <-tolower(tweets$text.orig) 
tweets$created <-as.POSIXct(as.numeric(as.vector(tweets$created.orig)), 
origin="1970-01-01") 


arrange(tweets, created) 


handle <- "“@TimesNow" 
tweets <-GetTweets(handle, 100) 


#Store the tweets as used in the book for future reproducibility 
write.csw(tweets, "Dataset/Twitter Feed From TimesNow.csv",row.names =FALSE) 
tweets[1:5, | 


text. orig 

1 Procedures for this are at DGMO level which have been activated: 
Def Min Parrikar on soldier who inadvertently cros<U+0085> https://t.co/ 
dUx77VDXGj 

4 IN PICS: Union Minister Venkaiah Naidu 
a tribute to Mahatma Gandhi #GandhiJayanti https://t.co/7gbSV4hHTN 

IN PICS: Union Minister Venkaiah Naidu flags 
off the ‘Swachhta Rally’ from India Gate, Delhi https://t.co/XOw0xJRoSG 
created.orig 


1 1475379487 
2 1475380198 
3 1475380803 
4 1475380922 
5 1475381398 


Now we have set up our Twitter account to pull feeds to our system. Now similarly 
let's set up a Microsoft cognitive services account. The package used for calling Microsoft 
services is mscstexta4r. The R Client for the Microsoft Cognitive Services Text Analytics 
REST API, including Sentiment Analysis, Topic Detection, Language Detection, and Key 
Phrase Extraction. An account must be registered at the Microsoft Cognitive Services web 
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site https: //www.microsoft.com/cognitive-services/ in order to obtain a (free) API 
key. Without an API key, this package will not work properly. 


#install.packages("mscstexta4r") 

library (mscstexta4r) 

Warning: package ‘mscstexta4r' was built under R version 3.2.5 
#Put the authentication APi keys you got from Microsoft 


Sys.setenv(MSCS TEXTANALYTICS URL ="https://westus.api.cognitive.microsoft. 
com/text/analytics/v2.0/") 
Sys.setenv(MSCS TEXTANALYTICS KEY ="YOUR KEY") 


#Initialize the service 
textaInit() 


Now one more input we need is a news article to show summarization. We are using 
this article: http://www. yourarticlelibrary.com/essay/essay-on-india-after- 
independence/41354/. 


# Load Packages 
require(tm) 
require(NLP) 
require(openNLP) 


#Read the Forbes article into R environment 
y <-paste(scan("Dataset/india after independence.txt", what="character”", 


sep="_"),collapse=" ") 


convert _text_to sentences <-function(text, lang ="en") { 

# Function to compute sentence annotations using the Apache OpenNLP Maxent 

sentence detector employing the default model for language ‘en’. 
sentence token annotator <-Maxent_Sent_Token_Annotator(language = lang) 


# Convert text to class String from package NLP 
text <-as.String(text) 


# Sentence boundaries in text 
sentence.boundaries <-annotate(text, sentence token _annotator) 


# Extract sentences 
sentences <-text[sentence. boundaries | 


# return sentences 
return(sentences ) 


# Convert the text into sentences 
article text =convert_text_to_sentences(y, lang ="en" 
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Now that we have all the inputs ready, we will show the four major analytics items as 
listed previously in our sample data. 


1. Sentiment Analysis 


Sentiment analysis will tell us what kind of emotions the 
tweets are carrying. The Microsoft API returns a value 
between 0 and 1, where 1 means highly positive sentiment 
while 0 means highly negative sentiment. 


document_lang <-rep("en", length(tweets$text) ) 
tryCatch({ 


# Perform sentiment analysis 

output_1 <-textaSentiment ( 

documents = tweets$text, # Input sentences or documents 

languages = document_lang 

# "en"(English, default) |"es"(Spanish) | "fr" (French) | "pt" (Portuguese) 
) 


}, error = function(err) { 


# Print error 
geterrmessage() 


}) 


merged <-output_1$results 


#Order the tweets with sentiment score 
ordered tweets <-merged[order(merged$score), | 


#Top 5 negative tweets 
ordered tweets[1:5, | 


text 

7 

pakistan has been completely cornered: shrikant sharma https://t.co/ 
ujdux8z3er 

99 hillary clinton says wave of 
shootings show need to protect children (pti) https://t.co/hptjov8eja 

6 southern california on heightened alert until tuesday following 
increased possibility of major earthquake:guv's office of emergency services 

10 china yet again blocks india's bid at the un to ban jaish-e-mohammad 
chief masood azhar by putting a technical hold https://t.co/yzomd77htr 

100 #update 
#baramulla terror attack- 1 bsf jawan martyred, 1 jawan injured: reports 

score 
7  0.1440058 
99 0.1752440 
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6 0.1770731 

10 0.1947241 

100 0.2508526 
#Top 5 Positive 
ordered tweets[95:100, | 


text 

73 the artists<U+0092> practice,the curator<U+0092>s vision, the 
commerce of the auction house,the best of the indian art world on<U+0085> 
https: //t.co/gbxzgzydzt 

37 the artists<U+0092> practice,the curator<U+0092>s vision, the 
commerce of the auction house,the best of the indian art world on<U+0085> 
https://t.co/tqxo7ytmku 

43 prime minister narendra modi extends new year 
greetings to jewish community around the world https://t.co/xzpoqq4npd 

54 china provides pak terror shield, stalls masood azhar<U+0092>s entry 
to terror list. #chinatopakrescue\n\ntune in,join special broadcast on @ 
timesnow 


90 founder of sulabh international bindeshwar pathak presents a 

book ‘mahatma gandhi's life in colour’ to pm modi https://t.co/rizsqwt93r 

9 2nd test, day 3: new 

zealand all out for 204 in 1st innings, india lead by 112 runs #indvsnz 
score 


73 0.9468260 
37 0.9484612 
43 0.9579207 
54 0.9739059 
90 0.9759967 
9 0.9879231 


The sentiment analyzer has worked really well on the latest 
100 tweets from the @TimesNow handle. You can do multiple 
things with this same application, for instance measure 

how many positive news and negative news ran on the 
leading news channel. This can give you a glimpse of general 
sentiment in the country. 


2. Topic detection 


For topic detection, let’s try to see what @CNN official Twitter 
handle talked about in their last 100 tweets. The topic 
detection algorithm will try to read last 100 tweets as if it were 
a conversation and will bring the topic discussed in those 
transcripts (or tweets). 
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handle <- “@CNN" 
topic text <-GetTweets(handle, 150) 
write.csw(topic text,"Dataset/Twitter Feed from CNN.csv",row.names=FALSE) 


tryCatch({ 


# Detect top topics in group of documents 
output _2 <-textaDetectTopics( 

topic text$text, # At least 100 documents (English only) 
stopWords =NULL, # Stop word list (optional) 
topicsToExclude =NULL, # Topics to exclude (optional) 
minDocumentsPerWord =NULL, # Threshold to exclude rare topics (optional) 
maxDocumentsPerWord =NULL, # Threshold to exclude ubiquitous topics 


(optional) 

resultsPollInterval = 30L, # Poll interval (in s, default: 30s, use OL for 
async) 

resultsTimeout = 1200L, # Give up timeout (in s, default: 1200s = 20mn) 


verbose =FALSE# If set to TRUE, print every poll status to stdout 
) 


}, error = function(err) { 


# Print error 
geterrmessage() 


}) 
output_2 

textatopics [https://westus.api.cognitive.microsoft.com/text/analytics/ 
v2.0/topics? | 

status: Succeeded 

operationId: 726edfccabdd4acb87a90716d7165343 

operationType: topics 

topics (first 20): 


keyPhrase score 
clinton o 7 
trump 15 
donald trump 10 
water 8 
rudy giuliani 8 
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hillary clinton 7 
president 5 
trump tax 4 
reporter 4 

water monitor lizards 4 
famous parks 4 
beer corpse 4 
iconic talking bear 4 
teddy ruxpin 4 
daymond john 3 

police officer 3 

president obama 3 

bernie sanders 3 

defend trump 3 
tax 3 


The topic detection in tweets list tells us that the CNN news 
channel was stalking about U.S. presidential candidates 
Donald Trump and Hillary Clinton. It also talks about school 
and students. 


3. Language detection 


Digital content nowadays is getting created in multiple 
languages. To broaden the scope of text mining, we need to 
automatically identify written languages and create collective 
senses out of them. Language detection methods helps 

us with identifying and translating languages. Here, I am 
creating five messages in five different language using Google 
translator. You can create your own examples. 
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<FONT ISSUE> 
#1-ARABIC, 2-POTUGESE, 3- ENGLISH , 4- CHINESE AND 5 - HINDI 


lang detect<-c("O£UTOS OOSU,,U... O§U,,0°USOSUTOSG2","Eu sou um cientista de 


dados","I am a data scientist", "æ" æ“ ä,eä,aç§ “à aM¢s,,we eBitx008D;®" , " 
ag®a¥ag, ageage agja¥tagVagh agua¥ agwa¥eageaghag"agzage àg'à¥, age") 





Figure 6-74. Language detection input 


tryCatch({ 


# Detect top topics in group of documents 
# Detect languages used in documents 
output _3 <-textaDetectLanguages( 
lang detect, # Input sentences or documents 
numberOfLanguagesToDetect = 1L # Default: 1L 


) 


}, error = function(err) { 


# Print error 
geterrmessage() 


}) 

output 3 

texta [https://westus.api.cognitive.microsoft.com/text/analytics/v2.0/langu 
ages ?numberOfLanguagesToDetect=1 | 


> output_3 
texta [https://westus.api.cognitive.microsoft.com/text/analytics/v2.@/Languages/number0f LanguagesToDetect=1] 


text name iso06391Name score 
Stitt we lic ar 1 
Eu sou um cientista de dados Portuguese pt 1 
I am a data scientist English en 1 
HR—THSRHARE Chinese_Simplified zh_chs 1 
A yE B21 Hindi hi 1 
Ei 


Figure 6-75. Language detection output 
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Microsoft has been able to detect all the language correctly. This service is very 
powerful when we know content about the same topic gets created in different languages 
and how to bring them into the same platform. 


4. Summarization 


For summarization we will use the article we loaded from the web site. The 
algorithm will try to contextually mine the document sentence by sentence and then will 
create an ordered list of sentences from the document that summarizes them. 


article lang <-rep("en", length(article text)) 
tryCatch({ 


# Get key talking points in documents 
output_4 <-textaKeyPhrases( 
documents = article text, # Input sentences or documents 
languages = article lang 
# "en"(English, default) /"de"(German)|"es"(Spanish)|"fr" (French) | "ja"(Japan 
ese) 


) 


}, error = function(err) { 


# Print error 
geterrmessage( ) 


}) 


#Print the top 5 summary 
output 4$results[1:5,1] 

[1] "While some have a high opinion of Indiad<U+0080><U+0099>s growth story 
Since its independence, some others think the countrya<U+0080><U+0099>s 
performance in the six decades has been abysmal." 

[2] "Ita<U+0080><U+0099>s arguably true that the Five-Year Plans did target 
Specific sectors in order to quicken the pace of development, yet the 
outcome hasna<U+0080><U+0099>t been on expected lines.” 

[3] "And, the country is taking its own sweet time to catch up with the 
developed world." 

[4] "All efforts are frustrated by lopsided strategies and inept 
implementation of policies." 

[5] "India is the worlda<U+0080><U+0099>s largest democracy." 


The summarization states that the article talks about India and its way toward 
development. It also emphasizes the democracy in India. 

In this chapter, we say how powerful the text analytics is for monitoring human 
behavior. We learned the basics in R and learned to use powerful APIs. You can explore 
more in the field of NLP. 
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6.12.7 Conclusion 


We saw an opportunity to convert poorly structured set of character streams and batches 
of data into a meaningful set of information using text mining based preprocessing 

and NLP algorithm-based model building. Though text mining is most appropriately 
placed under Natural Language Processing (NLP), which itself is considered a subfield of 
machine learning. The algorithms used for text summarization, part-of-speech tagging, 
uses statistical techniques heavily. 

We now move into the final topic of the chapter, where we will discuss the most 
contemporary ideas of making machine learning algorithms more suitable to work on 
streams of data, other words, algorithms that could learn from the continuous streams of 
data as they comes into the system instead of using a batch of training data. 


6.13 Online Machine Learning Algorithms 


In many practical machine learning models, adapting to the changing data in the real world 
is a critical requirement. There are two possibilities for tackling such changing needs: 


e Manually update the model frequently in a periodic manner 
(maybe once in a week, month or year) depending on how fast 
and how many changes take place in the business where the 
model is deployed once. Such as with medical diagnostics for 
cancer prediction. As you would expect, the type of cancer is not 
evolving very quickly with time. So, such a model could remain 
for a long time, even if there are no updates. However, when some 
new data from a cancer patient comes in, it’s possible to manually 
update the model and deploy it back into the system. 


e Update the model in real-time as the data is flowing in the system. 
For example, if Google completely moves to a machine learning 
model-based search engine, then the currently used heuristic 
algorithm, it might adapt on the go with search queries coming 
from the users. Figure 6-76 shows the process of online updates as 
the new data stream arrives into the system. 
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Figure 6-76. Online machine learning algorithms (Source: http: //www.doyensahoo. com/ 
introduction. html) 


Figure 6-76 shows how the predictor takes the continuous input data stream and 
learns from it and the feedback update happens to the learning model. 

There are many benefits and challenges that come with such online real-time-based 
learning methods, notably: 


e Efficient and space optimized: Since there is no need to pass a large 
amount of data as a batch to the learning model, we could train the 
model with one observation as a time and update the model. This 
brings speed of model training and optimized storage. Discard the 
data if it doesn't improve the model performance. 


e Difficult to create a pipeline: Creating such an online learning 
data pipeline is a challenging task. In one hand, if the volume 
and velocity of data is high, training the model could become a 
bottleneck. However, if the model pipeline is controlled, a lot of 
storage would be required. 


e Model evaluation is hard: Unlike the batch processing where we 
had a controlled training and testing dataset, wherein testing data 
could be used to evaluate the model, here with the online data, 
it’s not possible. At any given instance we don't know if the model 
has seen enough different types of observations to be able to truly 
perform as per the expectation. 


Even with many such challenges, online machine Learning is an emerging research 
as more and more systems are becoming real-time consumers of data and speed of 
adaptability is a top priority. We will use the House Worth dataset and apply the online 
update method of Unsupervised Fuzzy Competitive Learning. Although a detailed 
discussion of this topic is beyond the scope of this book, we will demonstrate with the help 
of an example of how well this method works for clustering problems. This method works 
by performing an update directly after each input signal (i.e., for each single observation). 
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6.13.1 Fuzzy C-Means Clustering 


This is the fuzzy version of the known k-means clustering algorithm as well as an online 
variant (Unsupervised Fuzzy Competitive learning). We will use the package e1071 in R, 
which has an implementation of the algorithm in a function named cmeans. 

As the R documentation on the topic describes, the data given by x is clustered by 
generalized versions of the fuzzy c-means algorithm, which use either a fixed-point or an 
online heuristic for minimizing the objective function. 


n Cc 
m 
> wuj d; 


i=l j=l 


where 


w, is weight of the observation i 
u, is the membership of observation i in cluster j 
d, is the distance between observation i and center of cluster j 


a. Data preparation 


library (ggplot2) 

Warning: package ‘ggplot2' was built under R version 3.2.5 

library (e1071) 

Warning: package 'e1071' was built under R version 3.2.5 

Data House Worth <-read.csv("Dataset/House Worth Data.csv",header=TRUE) ; 


str(Data House Worth) 


‘data. frame": 316 obs. of 5 variables: 

$ HousePrice : int 138800 155000 152000 160000 226000 275000 215000 
392000 325000 151000 ... 

$ StoreArea > num 29.9 44 46.2 46.2 48.7 56.4 47.1 56.7 84 49.2 ... 

$ BasementArea : int 75 504 493 510 445 1148 380 945 1572 506 ... 

$ LawnArea : num 11.22 9.69 10.19 6.82 10.92 ... 


$ HouseNetWorth: Factor w/ 3 levels "High", "Low", "Medium": 2 3 3 3 3131 
13... 
#remove the extra column that are not required for the model 
Data House Worth$BasementArea <-NULL 


b. Fuzzy c-mean clustering 


Observe that we are passing the value ucf1 to the parameter 
method, which does an online update of model using 
Unsupervised Fuzzy Competitive Learning (UCFL). 


online cmean <-cmeans(Data House Worth[,2:3],3,20,verbose=TRUE ,method="ufcl" 
ym=2 ) 

Iteration: 1, Error: 465.1579393478 

Iteration: 2, Error: 444.0414997086 
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Iteration: 3, Error: 424.6549206588 
Iteration: 4, Error: 406.6721061449 
Iteration: 5, Error: 389.8788008700 
Iteration: 6, Error: 374.1842570779 
Iteration: 7, Error: 359.5913592120 
Iteration: 8, Error: 346.1483860876 
Iteration: 9, Error: 333.9078002276 
Iteration: 10, Error: 322.9024279730 
Iteration: 11, Error: 313.1374056984 
Iteration: 12, Error: 304.5921263137 
Iteration: 13, Error: 297.2268898905 
Iteration: 14, Error: 290.9907447391 
Iteration: 15, Error: 285.8286344099 
Iteration: 16, Error: 281.6870892396 
Iteration: 17, Error: 278.5183573747 
Iteration: 18, Error: 276.2831875877 
Iteration: 19, Error: 274.9525794936 
Iteration: 20, Error: 274.5088021136 
print(online cmean) 
Fuzzy c-means clustering with 3 clusters 


Cluster centers: 
StoreArea LawnArea 
1 21.44992 9.584415 
2 43.59627 9.916090 
3 11.04677 11.214669 


Memberships: 

1 2 3 
[1,] 0.6250584893 2.446492e-01 1.302923e-01 
[2,] 0.0004209086 9.993824e-01 1.966837e-04 
[3,] 0.0110012372 9.835467e-01 5.452043e-03 
[4,] 0.0254099375 9.620333e-01 1.255677e-02 
[5,] 0.0344305970 9.474942e-01 1.807525e-02 


Closest hard clustering: 
2222222 


WRENNNNBRNNE 
BPNNNNNE WN 
NNWNNN PBN 
PNRPNNNNPB WN 
NNWNNN PBB 
NNNPRPNW WN 
PNPNRPRNPB 
PWRNPNN WN 
NNNNNNNRB WN 
PNFPNNRBRN WN 
NWNNPRPRNPRP BR 
PWRNBPBNNWN 
NNWPPRPR ND WwW 
NPNPNWNNPB 
PNRPRPNNNNWN 
PRNPNNN PB Ww 
NNPNWREN BW 
PNNNNNNRB WN 
NNPNNNNN WN 
NNPNPRPRP PB 
PRNBPNNWNPB 
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Available components: 
[1] "centers" "size" "cluster" "membership" "iter" 


[6] “withinerror" "call" 


c. Visual evaluation of cluster accuracy 


The plot shows the overlap of cluster formed by online fuzzy 
c-means algorithm and the classification variable we created 
manually. The plot has a near perfect overlap, which indicates 
a good cluster. 


ggplot(Data House Worth, aes(StoreArea, LawnArea, color = HouseNetWorth)) + 
geom_point(alpha =0.4, size =3.5) +geom_point(col = online cmean$cluster) + 
scale_color_manual(values =c('black', ‘red', 'green')) 


The plot in Figure 6-77 shows that the clusters substantially overlap on our prior 
classification. This is fair evidence of the power of online machine learning. 
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Figure 6-77. Cluster plot with fuzzy C-means clustering 
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6.13.2 Conclusion 


In today’s fast world the time to decision is more important than the quality of decision. 
Partly it’s driven by the competitive landscape and partly due to cost of delay. Online 
machine learning tools and techniques are bound to rise in the machine learning world 
in the coming days. Our industry and researchers have to work together to create elegant 
algorithms as well as hardware/software that can implement those algorithms with high 
volume and high velocity of data flow. 


6.14 Model Building Checklist 


Before the chapter ends, we have complied a checklist of questions that you need to 
address before taking up any project in machine learning. Whenever it comes to choosing 
a ML algorithm or deciding to use ML on a new problem, an assessment of the available 
data is the most important part in the entire process. Ask this broad checklist of questions 
before proceeding any further: 


e What is that you want to achieve in this problem? Is the goal to 
predict, estimate a value, find patterns, or just explore ? 


e What are the types of each variable in the dataset? Is it all 
numeric, categorical, or mixed? 


e Have you identified the response (output) and predictor (input) 
variables? 


e Are there many missing values and outliers in the data? 


e How would you solve the problem if let’s say, ML algorithms 
are not to be used. Is it possible to explore the data using simple 
statistics and visualization to arrive at the answers to the problem 
without ML? 


e Does the boxplot, histogram, or scatter plot show any interesting 
insights in the data? 


e Did you find the standard deviation, quartile, mean, and 
correlation measures for all numerical variables? Does it show 
anything interesting? 


e How large is your dataset? Does your problem require the 
complete data to be used or is asmall sample good enough? 


e Are there enough computational resources (RAM, storage, and 
CPU) to run any ML algorithm? 


e Do you think that the current data might soon become old and 
the ML model might require an update soon after it’s built? 


e Are there any plans to build a data product out of the final ML 
model? 
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This checklist might sound a little too big and random; however, if you figure out 
the answers to these questions before you jump into building a ML model, you will 
potentially have a savings of 40%-60% of your time. 


6.15 Summary 


A field like machine learning is vast because of the application it has found over the year 
in many academic disciplines and industries. The years of advancement in tools and 
technology has taken machine leaning a step closer to even the naive user without much 
statistical background. This has given rise to the practical applicability of the methods 
found in machine learning and development of many ML-centric products and design. 
We are living in exciting times to be in the field of machine learning, which offers endless 
opportunities. Experts who are machine learning literates are high in demand in many 
industries. The time is not far away when machine learning will form the core of every 
industry and product, where it’s not just coding the software with some set of logical 
statements but infusing a learning algorithm within which it adapts to the changing needs. 
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CHAPTER 7 





Machine Learning Model 
Evaluation 





Model evaluation is the most important step in developing any machine learning 
solution. At this stage in model development we measure the model performance and 
decide whether to go ahead with the model or revisit all our previous steps as described 
in the PEBE, our machine learning process flow, in Chapter 1. In many cases, we may 
even discard the complete model based on the performance metrics. This phase of the 
PEBE plays a very critical role in the success of any ML based projects. 

The central idea of model evaluation is minimizing the error on test data, where 
error can be defined in many ways. In most intuitive sense, error is the difference between 
the actual value of the predictor variable in data and the value the ML model predicts. 
The error metrics are not always universal, and some specific problems require creative 
error metrics that suit the problem and the domain knowledge. 

It is important to emphasize here that the error metric used to train the model might 
be different from evaluation error metric. For instance, for a classification model you 
might have used the LogLoss error metric, but for evaluation the model, you might want 
to see a classification rate using a confusion matrix. 

In this chapter, we will enumerate the basic idea behind evaluating a model and 
discuss some of the methods in detail. 

Learning objectives 


e Introduction to model performance and evaluation 
e Population stability index 

e Model evaluation for continuous output 

e Model evaluation for discrete output 

e Probabilistic techniques 


e Illustration of advanced metrics like the Kappa Error Metric 
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7.1. Dataset 


The dataset for this chapter is same as what we introduced in the Chapter 6 to explain the 
machine learning techniques for regression-based methods and classification problems. 
Let’s do a quick recap of them once and then we can jump into the concepts. 


7.1.1 House Sale Prices 


We will be using the house sale prices dataset detailed in Chapter 6. Let’s have a quick 
look at the dataset. 


library(data.table) 


Data House Price <-fread("Dataset/House Sale Price Dataset.csv",header=T, 
verbose =FALSE, showProgress =FALSE) 


str(Data House Price) 


Classes ‘data.table' and ‘data.frame': 1300 obs. of 14 variables: 

$ HOUSE_ID : chr "0001" "0002" "0003" "0004" ... 

$ HousePrice : int 163000 102000 265979 181900 252000 180000 115000 
176000 192000 132500 ... 

$ StoreArea : int 433 396 864 572 1043 440 336 486 430 264 ... 

$ BasementArea : int 662 836 0 594 O 570 O 552 24 588 ... 

$ LawnArea : int 9120 8877 11700 14585 10574 10335 21750 9900 3182 
T1560. ssa 

$ StreetHouseFront: int 76 67 65 NA 85 78 100 NA 43 NA... 

$ Location : chr “RK Puram" "Jama Masjid" “Burari® "RK Puram” ... 

$ ConnectivityType: chr “Byway "Byway" "Byway" "Byway" ... 

$ BuildingType : chr “IndividualHouse” “IndividualHouse" 
"IndividualHouse" "IndividualHouse" ... 


$ ConstructionYear: int 1958 1951 1880 1960 2005 1968 1960 1968 2004 1962 


$ EstateType : chr “Other” "Other" “Other” "Other" ... 

$ SellingYear : int 2008 2006 2009 2007 2009 2006 2009 2008 2010 2007 
$ Rating : int 6476855785... 

$ SaleType : chr ‘“NewHouse" "NewHouse" "“NewHouse’ “NewHouse’ ... 


- attr(*, ".internal.selfref")=<externalptr> 


These are the variables and their types. It can be seen that the data is a mix of 
character and numeric data. 

The following code and Figure 7-1 present a summary of House Sale Price. This is 
our dependent variable in all the modeling examples we have built in this book. 


dim(Data House Price) 
[1] 1300 14 
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Check the distribution of dependent variable ( House Price). We plot a 
histogram to see how the House Price are spread in our dataset. 


hist(Data House Price$HousePrice/1000000, breaks=20, col="blue", xlab="House 


Sale Price(Million)", 
main="Distribution of House Sale Price") 


Distribution of House Sale Price 
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Figure 7-1. Distribution of house sale price 


Here, we call the summary() function to see basic properties of the HousePrice data. The 
output gives us a minimum, first quantile, median, mean, third quantile, and maximum. 


#Also look at the summary of the Dependent Variable 
summary (Data House Price$HousePrice) 

Min. 1st Qu. Median Mean 3rd Qu. Max. 

34900 129800 163000 181500 214000 755000 
#Pulling out relevant columns and assigning required fields in the dataset 
Data House Price <-Data House Price[,.(HOUSE ID,HousePrice, StoreArea, StreetH 
ouseFront , BasementArea, LawnArea, Rating, SaleType) | 


The following code snippet removes the missing values from the dataset. This is 
important to make sure the data is consistent throughout. 


#Omit Any missing value 
Data House Price <-na.omit(Data House Price) 


Data House Price$HOUSE ID <-as.character(Data House Price$HOUSE ID) 
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These statistics give us some idea of how the house price is distributed in the dataset. 
The average sale price is $181,500 and the highest sale price is $755,000. 


7.1.2 Purchase Preference 


This data contains transaction history for customers who bought a particular product. For 
each customer ID, multiple data points are simulated to capture the purchase behavior. 
The data is originally set for solving multiple classes with four possible products of 
insurance industry. Here, we show summary of the purchase prediction data. 


Data Purchase <-fread("Dataset/Purchase Prediction Dataset.csv",header=T, 
verbose =FALSE, showProgress =FALSE) 
str(Data_ Purchase) 


Classes ‘data.table’ and ‘data.frame': 500000 obs. of 12 variables: 
$ CUSTOMER _ID : chr "000001" "000002" "000003" "000004" ... 
$ ProductChoice > int 2323232223... 
$ MembershipPoints : int 6242665953... 
$ ModeOfPayment : chr ‘“MoneyWallet" "CreditCard" "“MoneyWallet" 

"MoneyWallet" ... 
$ ResidentCity : chr “Madurai” "Kolkata" "Vijayawada" “Meerut” ... 
$ PurchaseTenure : int 441063313198... 
$ Channel : chr "Online" "Online" "Online" "Online" ... 
$ IncomeClass chr a ey 5A A se 
$ CustomerPropensity : chr “Medium” "VeryHigh" "Unknown" "Low" 
$ CustomerAge : int 55 75 34 26 38 71 72 27 33 29... 
$ MartialStatus : int 0000100001... 


$ LastPurchaseDuration: int 4 1515661054156... 
- attr(*, ".internal.selfref")=<externalptr> 


This data output shows a mixed bag of variables in the purchase prediction data. 
Carefully look at the dependent variable in this dataset, PurchaseChoice, which was 
loaded as an integer. We have to make sure before we use that for modeling that it’s 
converted into factor. 

Similar to the continuous dependent variable, we will create the dependent variable 
for discrete case from the purchase prediction data. For simplicity and easy explanation, 
we will only be working with product preference ProductChoice as a dependent variable 
with four levels (i.e., 1, 2, 3, and 4). 


dim(Data Purchase) ; 

[1] 500000 12 
#Check the distribution of data before grouping 
table(Data Purchase$ProductChoice) 


1 2 3 4 
106603 199286 143893 50218 
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The barplot below shows the distribution of ProductChoice. The highest 
volume is in for ProductChoice = 2, then 3 followed by 1 and 4. 


barplot(table(Data Purchase$ProductChoice) ,main="Distribution of 
ProductChoice", xlab="ProductChoice Options", col="Blue") 


Distribution of ProductChoice 
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Figure 7-2. Distribution of product choice options 


In the following code, we subset the data to select only the columns we will be using 
in this chapter. Also we remove all missing values (NA) to keep the data consistent across 
different options. 


#Pulling out only the relevant data to this chapter 


Data Purchase <-Data Purchase[,.(CUSTOMER ID,ProductChoice,MembershipPoints, 
IncomeClass ,CustomerPropensity, LastPurchaseDuration) | 


#Delete NA from subset 
Data Purchase <-na.omit(Data Purchase) 
Data Purchase$CUSTOMER ID <-as.character(Data Purchase$CUSTOMER ID) 


This subset of data will be used throughout this chapter to explain the various 
concepts. 
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7.2 introduction to Model Performance and 
Evaluation 


Model performance and evaluation is carried out once you have developed the model 
and want to understand how the model performs on the test data/validation data. Before 
the start of model development, you usually divide the data into three categories: 


e Training data: This dataset is used to train the model/machine. 
At this stage, the focus of the machine learning algorithm is to 
optimize some well-defined metric reflecting the model fit. For 
instance, in Ordinary Least Square, we will be using the training 
data to train a linear regression model by minimizing squared 
errors. 


e Testing data: Test dataset contain data points that the ML 
algorithm has not seen before. We apply this dataset to see 
how the model performs on the new data. Most of the model 
performance and evaluation are calculated and evaluated against 
thresholds in this step. Here, the modeler can decide if the model 
needs any improvement and can make the changes and tweaks 
accordingly. 


e Validation data: In many cases, the modeler doesn’t keep this 
dataset due to multiple reasons (e.g., limited data, short time 
period, larger test set etc.). In essence, this dataset’s purpose is 
to check for overfitting of the model and provide insights into 
calibration needs. Once the modeler believes the ML model has 
done well on testing data and starts to use validation data, they 
can’t go back and change the model. They rather have to try to 
calibrate the model and check for overfitting. If the model fails 
to set standards, we are forced to drop the model and start the 
process again. 


Depending on the problem and other statistical constraints, the proportion of 
these datasets will be decided. In general, for sufficiently large data we may use the 
60%:20%:20% ratio for our training, testing, and validation datasets. 

Model performance is measured using test data and the modeler decides what 
thresholds are acceptable to validate the model. Performance metrics are in general 
generated using the basic criteria of model fit, i.e., how different the model output is 
from the actual. This error between actual and predicted will be the error that should be 
minimized for a good performance. 

Within the scope of this book, we will be discussing how to use some commonly 
used performance and evaluation metrics on two types of model output (predictor) 
variables: 


e Continuous output: The model or series of models that give 
continuous predicted value against a continuous dependent 
variable in model. For instance, house prices are continuous and, 
when used to predict using a model, will be giving continuous 
predicted values. 
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e Discrete output: The model or series of models that gives discrete 
predicted value against a discrete dependent variable in model. 
For instance, for a credit card application, the risk class of the 
borrower when used in predictive model for classification will 
give a discrete predicted value (i.e., predicted risk class). 


We can expand this list based on other complicated modeling techniques and how 
we want to evaluate them. For instance, think about a logistic model; the dependent 
is a binomial distributed variable but the output is on the probability scale (0 to 1). 
Depending on what is the final purpose of the business, we have to decide what to 
evaluate and at what step of the process. For completeness purposes, you can use 
concordant-discordant ratios to evaluate the model separation power among Os and 
1s. Concordant-discordant ratios are discussed in Chapter 6 . Reader is encouraged to 
pursue statistical underpinning of model performance measurement concepts. 


7.3 Objectives of Model Performance 
Evaluation 


Business stakeholders play an important role in defining the performance metrics. The 
models have direct implications on costs for business. Simply minimizing a complicated 
statistical measure might not always be the best model for a business. For illustration 
purposes, assume a credit risk model for credit scoring new applicants. A few of the input 
variables is internal and some are purchased from external sources. The model performs 
really well by having external data from multiple parties, which comes with a cost. In that 
case simply having a model with minimum classification error is not enough; the model 
output should also make economic sense to the business. 

In general, we can classify the purpose of model performance and evaluation focus 
into three buckets. These three are part of general framework for using statistical methods 
and their interpretation. 


e Accuracy: The accuracy of a model reflects the proportion of right 
predictions—in a continuous case, its minimum residual, and 
in discrete, the correct class prediction. A minimum residual in 
continuous cases or few incorrect classifications in discrete case 
implies higher accuracy and a better model. 


e Gains: The gains statistic gives us an idea about the performance 
of the model itself. The method is generalized to different 
modeling techniques and is very intuitive. This compares the 
model output with the result that we get without using a model 
(or arandom model). So in essence, this will tell you how good 
the model is compared to a random model that has an random 
outcome. When comparing two models, the model having the 
higher gains statistics at a specified percentile is preferred. 
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e Accreditation: The model accreditation reflects the credibility of 
a model for actual use. This approach ensures that the data on 
which model is applied is similar to the training data. Population 
stability index is one of the measures to ensure accreditation 
before using the model. Population stability index is a measure 
to ascertain if the model training dataset is similar to the data 
where the model is used, or the population is stable with respect 
to the features used in the model. The index value varies from 0 
to 1, with high values indicating greater similarity between the 
predictors in the two datasets. A stable population confirms the 
use of model for prediction. 


These kind of scenarios are abundant in actual practice. In this book, we will discuss 
the basic statistical methods used to evaluate the model performance. We will also look 
at the intuitive way of thinking about model performance. Intuitive ways of thinking help 
create new error metrics and add business context while measuring model performance. 


7.4 Population Stability Index 


Population stability is seldom ignored by modelers while testing the model performance 
on various datasets. The idea here is to ensure that the testing dataset is same as the 
train dataset. If this is the case, the model performance tested on this data will give you 
insights into how well the model performed; otherwise, your model performance results 
are of no use. 

Consider an example. You developed a model for predicting mean income of U.S. 
consumers using a dataset from 2000 to 2009. You developed the model by training it on 
dataset from 2000 to 2007 and then kept the last two years for testing the model. What 
is going to happen with the test results? The trained model might be the best model 
but the model performance in the test results is still bad. Why? Because the population 
characteristics between train and test have changed. The U.S. economy went through 
a severe recession between Q4 2007 and Q4 2008. In statistical terms, the underlying 
population is not stable between two periods. 

Population Stability is very important in time series data to keep following the 
underlying changes in the population to make sure that the model stays relevant. The 
financial service industry has been using this metric for a long time to make sure the 
financial models are relevant to the market. 

Let’s illustrate the concept of population stability for a continuous distribution. We 
will divide the population data into two portions, say set 1 and set 2. In machine learning 
performance testing, think about set 1 as the train data and set 2 as the test data. 


Note The concept of population stability is very important when the underlying 
relationship structure of dependent and independent variable is effected by external 
unseen factors. 
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#Create set 1 and set 2 : First 2/3 as set 1 and remaining 1/3 as set 2 
summary (Data House Price$HousePrice) 

Min. 1st Qu. Median Mean 3rd Qu. Max. 

34900 127500 159000 181300 213200 755000 
set_1 <-Data House Price[1:floor(nrow(Data House Price)*(2/3)), ]|$HousePrice 
summary(set_ 1) 

Min. 1st Qu. Median Mean 3rd Qu. Max. 

34900 128800 160000 180800 208900 755000 
set_2 <-Data House Price[floor(nrow(Data House Price)*(2/3) +1):nrow(Data_ 
House Price), ]$HousePrice 
summary(set_ 2) 

Min. 1st Qu. Median Mean 3rd Qu. Max. 

52500 127000 155000 182200 221000 745000 


For the continuous case, we can check for stability using two sample Kolmogorov- 
Smirnov tests (KS test). KS testing is a non-parametric test for comparing the cumulative 
distribution of two samples. 

The empirical distribution function Fn for n iid observations Xi is defined as: 


1 
ti (x) E nation (X;) 


where Dosp (X,) is the indicator function, equal to 1 if X, <x and equal to 0 otherwise. 


The Kolmogorov-Smirnov statistic for a given cumulative distribution function F(x) is 





D, =sup|F, (x)-F (x) 


where sup xis the maximum of the set of distances. 

Essentially, the KS statistic will get the highest point of difference between the 
empirical distribution comparison of two samples and, if that is too high, we say the two 
samples are different. In terms of population stability, it says your model performance 
can’t be measured on new samples and the underlying sample is not from the same 
distribution on which the model was trained. 

In following code first defines a function ks_test() that plots the Empirical 
Cumulative Distribution Function (ECDF) and display the KS test result. 


#Defining a function to give ks test result and ECDF plots on log scale 
library(rgr) 
ks_test <-function (xx1, xx2, xlab ="House Price", x1lab 
=deparse(substitute(xx1)),x2lab =deparse(substitute(xx2)), ylab ="Empirical 
Cumulative Distribution Function",log =TRUE, main ="Empirical EDF Plots - 
K-S Test", pch1 =3, coll =2, pch2 =4, col2 =4, cex =0.8, cexp =0.9, ...) 
{ 

temp.x <-remove.na(xx1) 

x1 <-sort(temp.x$x[1:temp.x$n] ) 
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nx1 <-temp.x$n 
y1 <-((1:nx1) -0.5)/nx1 
temp.x <-remove.na(xx2) 
x2 <-sort(temp.x$x[1:temp.x$n] ) 
nx2 <-temp.x$n 
y2 <-((1:nx2) -0.5)/nx2 
xlim <-range(¢(x1, x2)) 
if (log) { 
logx <- "x" 
if (xlim[1] <=0) 
stop("\n Values cannot be .le. zero for a log plot\n") 
else logx <- "" 
plot(xi, y1, log = logx, xlim = xlim, xlab = xlab, ylab = ylab, 
main = main, type ="n", ...) 
points(x1, y1, pch = pch1, col = col1, cex = cexp) 
points(x2, y2, pch = pch2, col = col2, cex = cexp) 
temp <-ks.test(x1, x2) 
print(temp) 
} 


Here, we call the custom function, which perform this KS test on set_1 and set_2 
and display the Empirical Cumulative Distribution Plots (ECDF): 
#Perform K-S test on set_1 and set 2 and also display Empirical Cummulative 


Distribution Plots 
ks_test(set_1,set_2) 


Empirical EDF Plots - K-S Test 





Empirical Cumulative Distribution Function 
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Figure 7-3. ECDF plots for Set_1 and Set_2 
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Here, we show the hypothesis test results for the KS test. This is the Kolmogorov- 
Smirnov test for the hypothesis that both distributions were drawn from the same 
underlying distribution. 


Two-sample Kolmogorov-Smirnov test 


data: x1 and x2 
D = 0.050684, p-value = 0.5744 
alternative hypothesis: two-sided 


As you can see, the p-value is more than 0.05 and we fail to reject the null hypothesis. 
So we are good to go ahead and test model performance on test data. Also, looking at the 
Empirical Cumulative Distribution Function (ECDF) plot, we can see the ECDF for both 
the samples look the same, and hence they come from the same population distribution. 

How do the results look when the population becomes unstable? Let’s manipulate 
our set_2 to show that scenario. 

Consider that set_2 got exposed to a new law, where the houses in set_2 were 
subjected to additional tax by a local body and hence the prices went up. The question we 
will have is, can the existing model still perform well on this new set? 


#Manipulate the set 2 
set_2 new <-set_2*exp(set_2/100000) 


# Now do the k-s test again 
ks_test(set_1,set_2 new) 


Now let’s again plot the ECDF for set_1 and set_2 and see how they look in 
comparison (see Figure 7-4). 


Empirical EDF Plots - K-S Test 


1.0 


Empirical Cumulative Distribution Function 
0.2 0.4 0.6 0.8 


0.0 





1e+05 1e+06 1e+0r 1e+08 1e+09 


House Price 
Figure 7-4. ECDF Plots for Set_1 and Set_2 (Manipulated) 
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We again perform the KS test to check the hypothesis results. 


Two-sample Kolmogorov-Smirnov test 


data: x1 and x2 
D = 0.79957, p-value < 2.2e-16 
alternative hypothesis: two-sided 


The KS test’s p-value is less than 0.05 and hence the test rejects the null hypothesis 
that both samples are from the same population. Visually the ECDF plots look way off 
to each other. Hence, the model can’t be used on new dataset, although the dataset is of 
same schema and business feed. 

We can quickly show how to do population stability tests for discrete cases of 
purchase prediction for ProductChoice. The test is performed by calculating the statistic, 
Population Stability Index (PSI), defined as here: 


PSI=y((n1i/N1)-(n2i/N2))*In((nli/N1)/(n2i/N2)) 


where: nli,n2i is the number of observations in bin i for populations 1 and 2, and 
N1,N2 is the total number of observations for populations 1 and 2. 

As the Population Stability Index for the discrete case does not follow a distribution, 
we have threshold values. As a rule, values below thresholds can be used to interpret the 
population stability index: 


e APSI<0.1 indicates a minimal change in the population. 
e APSI0.1 to 0.2 indicates changes that require further investigation. 
e APSI> 0.2 indicates a significant change in the population. 


This code snippet calculates the Population Stability Index using this formula. 


#Let's create set 1 and set 2 from our Purchase Prediction Data 
print("Distribution of ProductChoice values before partition") 

[1] "Distribution of ProductChoice values before partition" 
table(Data_ Purchase$ProductChoice) 


1 2 3 4 
104619 189351 142504 49470 
set_1 <-Data Purchase[1:floor(nrow(Data Purchase)*(2/3)), ]$ProductChoice 
table(set_ 1) 
set_1 
1 2 3 4 
69402 126391 95157 33012 
set_2 <-Data_Purchase[floor(nrow(Data_Purchase)*(2/3) +1):nrow(Data_ 
Purchase), ]$ProductChoice 
table(set_2) 
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set_2 
1 2 3 4 
35217 62960 47347 16458 


Now we will treat set_1 as population 1 and set_2 as population 2 and calculate the 
PSI. A similar exercise can be repeated with different parameters to see if the population 
remains stable with respect to other discrete distributions. 


#PSI=Summation((n1i/N1)(n2i/N2))ln((n1i/N1)/(n2i/N2)) 

temp1 <-(table(set_1)/length(set_1) -table(set_2)/length(set_2)) 

temp2 <-log((table(set_1)/length(set_1))*(table(set_2)/length(set_2))) 
psi <-abs(sum(temp1*temp2)) 


if(psi <0.1 ){ 

cat("The population is stable with a PSI of " ,psi) 

} else if (psi >=0.1&psi <=0.2) { 

cat("The population need further investigation with a PSI of " ,psi) 

} else { 

cat("The population has gone through significant changes with a PSi of " ,psi) 


The population is stable with a PSI of 0.002147654 


As you must have observed from these examples, essentially we are comparing two 
distributions and making sure the distributions are similar. This test helps us ascertain 
how credible the model would be on the new data. 


7.5 Model Evaluation for Continuous Output 


The distribution of dependent variables is an important consideration in choosing 

the methods for evaluating the models. Intuitively, we end up comparing the residual 
distribution (actual versus predicted value) with either normal distribution (i.e., random 
noise) or some other distribution based on the metrics we choose. 

This section is dedicated to the cases where the residual error is on a continuous 
scale. Within the scope of this chapter, we will focus on the linear regression model and 
calculate some basic metrics. The metrics come with their own merits and demerits, and 
we will try to focus on some of them from a business interpretation perspective. 

Let’s fit a linear regression model with the variables subsetted to a forward 
selection on the house price data. Then, with this model, we will show different model 
performance metrics. 


# Create a model on Set 1 = Train data 
linear _reg model <-1m(HousePrice ~StoreArea +StreetHouseFront +BasementArea 
+LawnArea +Rating +SaleType ,data=Data House Price[1:floor(nrow(Data_ 


House Price)*(2/3)),]) 
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summary(linear reg model) 


Call: 

lm(formula = HousePrice ~ StoreArea + StreetHouseFront + BasementArea + 
LawnArea + Rating + SaleType, data = Data House_ 

Price[1:floor(nrow(Data House Price) * 


(2/3)), ]) 
Residuals: 
Min 10 Median 30 Max 


-432276 -22901 -3239 17285 380300 


Coefficients: 

Estimate Std. Error t value Pr(>|t]) 
(Intercept) -8.003e+04 3.262e+04 -2.454 0.014387 * 
StoreArea 5.817e+01 9.851e+00 5.905 5.48e-09 *** 
StreetHouseFront 1.370e+02 8.083e+01 1.695 0.090578 . 
BasementArea 2.362e+01 3.722e+00 6.346 3.96e-10 *** 
LawnArea 7.746e-01 1.987e-01 3.897 0.000107 *** 
Rating 3.540e+04 1.519e+03 23.300 < 2e-16 *** 
SaleTypeFirstResale 1.012e+04 3.250e+04 0.311 0.755651 
SaleTypeFourthResale -3.221e+04 3.678e+04 -0.876 0.381511 
SaleTypeNewHouse -1.298e+04 3.190e+04 -0.407 0.684268 
SaleTypeSecondResale -2.456e+04 3.248e+04 -0.756 0.449750 
SaleTypeThirdResale -2.256e+04 3.485e+04 -0.647 0.517536 
Signif. codes: O ‘***' 0.001 ‘'**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 


Residual standard error: 44860 on 701 degrees of freedom 
Multiple R-squared: 0.7155, Adjusted R-squared: 0.7115 
F-statistic: 176.3 on 10 and 701 DF, p-value: < 2.2e-16 


The model summary shows a few things: 


e The Multiple R Square of the fitted model is 71.5%, which is a 
good fit model. 


e The SaleType variable is insignificant at all levels (but we have 
kept that in model as we believe that it’s an important element of 
HousePrice). 


e The p-value for the F-test of the overall significance test is less 
than 0.05, so we can reject the null hypothesis and conclude that 
the model provides a better fit than the intercept-only model. 


Now we will move on to the performance measures for a continuous dependent 
variable. 
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7.5.1 Mean Absolute Error 


Mean absolute error or MAD is one of the most basic error metrics used to evaluate a 
model. MAD is directly derived from the residual error first norm. This is the average/ 
mean of the absolute errors. 

In statistics, the mean absolute error is an average of the absolute errors 





l% l% 
E=— „— y |=> 
MAE =" Z Due 


where fi is the prediction and yi the true value. 

There are other similar measures like Mean Absolute Scaled Error (MASE) and 
Mean Absolute Percentage Error (MAPE). In all these measures, the performance is 
summarized in a way that it treats both underprediction and overprediction the same, 
and mean signed difference is ignored. This is a specific demerit because of ignorance to 
over-prediction or under-prediction. In business problems we are usually fine with error 
in one direction but not the other. For instance, calculating credit loss on credit cards. 
The business should be fine if it is overpredicting the loss and hence keeping a little more 
reserve. However, the other side is highly costly and may trigger bankruptcy in extreme 
cases. 


#Create the test data which is set 2 
test <-Data_ House Price[floor(nrow(Data House Price)*(2/3) +1):nrow(Data_ 
House Price), | 


#Fit the linear regression model on this and get predicted values 
predicted 1m <-predict(linear reg model,test, type="response") 


actual predicted <-as.data.frame(cbind(as.numeric(test$HOUSE ID),as. 
numeric(test$HousePrice) ,as.mumeric(predicted 1m))) 


names(actual predicted) <-e("HOUSE ID", "Actual", "Predicted" ) 


#Find the absolute residual and then take mean of that 
library (ggplot2) 


#Plot Actual vs Predicted values for Test Cases 

ggplot(actual predicted,aes(x = actual predicted$HOUSE ID,color=Series)) + 
geom_line(data = actual predicted, aes(x = actual predicted$HOUSE ID, 

y =Actual, color ="Actual")) + 

geom_line(data = actual predicted, aes(x = actual predicted$HOUSE ID, y = 
Predicted, color ="Predicted")) +xlab('HOUSE ID') +ylab('House Sale Price’) 
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It’s clear from the plot in Figure 7-5 that the actual is very close to the predicted. Now 
let’s find out how our model is performing on a Mean Square Error metric. 
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Figure 7-5. Actual versus predicted plot 


#Remove NA from test, as we have not done any treatment for NA 
actual predicted <-na.omit(actual predicted) 


#First take Actual - Predicted, then take mean of absolute errors(residual) 


mae <-sum(abs(actual predicted$Actual -actual predicted$Predicted) )/nrow(ac 
tual_predicted) 


cat("Mean Absolute Error for the test case is ", mae) 
Mean Absolute Error for the test case is 29570.3 


The MAE says on average the error is $29,570. This is equivalent to saying on dollar 
scale 17% error is expected for a mean of $180,921. 

This metric can also be used to fit linear model. Just as least square method is related 
to mean squared errors, mean absolute error is related to least absolute deviations. 
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7.9.2 Root Mean Square Error 


Root mean square error or RMSE is one of the most popular metrics used to evaluate 
continuous error models. As the name suggests, it is the square root of mean of squared 
errors. The most important feature of this metric is that the errors are weighted by means 
of squaring them. 

For example, suppose the predicted value is 5.5 while the actual value is 4.1. Then 
the error is 1.4 (5.5 - 4.1). The square of this error is 1.4 x 1.4 = 1.96. Assume another 
scenario, where the predicted value is 6.5, then the error is 2.4 (6.5 - 4.1), and the square 
of error is 2.4 x 2.4 = 5.76. As you can see, while the error only changed 2.4/1.4 = 1.7 times, 
the squared error changed 5.76/1.96 = 2.93 times. Hence, RMSE penalizes the far off error 
more strictly than any close by errors. 

The RMSE of predicted values y for times t of a regression’s dependent variable y, is 
computed for n different predictions as the square root of the mean of the squares of the 
deviations: 





It is important to understand how the operations in the metric change the 
interpretation of the metric. Suppose our dependent variable is house price, which is 
captured in dollar numbers. Let’s see how the metric dimensions evolve to interpret the 
measure. 

The predicted and actual value is in dollars, so their difference is error, again in 
dollars. Then you square the error, so the dimension becomes dollar squared. You can’t 
compare a dollar square value to a dollar value. So, we square root that to bring back the 
dimension to dollars and can now interpret RMSE is dollar terms. It’s important to note 
that, generally the metrics for model comparison are dimensionless, but for model itself 
we prefer metrics having some dimension to provide a business context to the metric. 


#As we have already have actual and predicted value we can directly 
calculate the RMSE value 


rmse <-sqrt(sum((actual_ predicted$Actual- 
actual _predicted$Predicted)*2)/nrow(actual_ predicted) ) 


cat("Root Mean Square Error for the test case is ", rmse) 
Root Mean Square Error for the test case is 44459.42 


Now you can see that the error has scaled up to $44,459. This is due to the fact now 
we are penalizing the model for far away predictions by means of squaring the errors. 
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As mentioned earlier as well, if you want to use a metric to compare datasets or 
models with different scales, you need to bring the metric into a dimensionless form. We 
can do the same with RMSE by normalizing it. The most common way is by dividing the 
RMSE by range or mean: 


NRMSD = es NRMSD = eae 


Vias E Y min y 





This value is referred to as the normalized root-mean-square deviation or error 
(NRMSD or NRMSE), and usually expressed as a percentage. A low value indicates less 
residual variance and hence is a good model. 


7.5.3 R-Square 


R-square is a popular measure used for linear regression based techniques. 

The appropriate terminology used by statisticians for R-square is Coefficient of 
Determination. The Coefficient of Determination gives an indication of the relationship 
between the dependent variable (y) and a set of independent variables (x). In 
mathematical form, it is a ratio of residual sum of squares and total sum of squares. 
Again, note that this measure is also originating from residual (error metric) using actual 
and predicted values. Here, we explain how the R? metric gets calculated for a model, and 
then how we interpret the metric. 


Note Capital R? and r? are loosely used interchangeably but they are not same. R°? is 
the multiple R? in a multiple regression model. In bivariate linear regression, there is no 
multiple R, and R°=r" So the key difference is applicability of the term (or notation): 
“multiple R" implies multiple regressors, whereas “R?” doesn’t. 


A dataset has n values marked y1...yn (collectively known as yi or as a vector 
y= [yl...yn]), each associated with a predicted (or modeled) value f1...fn (known as fi, or 
sometimes fi, as a vector f). 

The residual (error in prediction) is defined as ei = yi - fi (forming a vector e). 


If y is the mean of the observed data y = >J y, then the variability of the dataset 
n 


i=l 


can be measured using three sums of squares formulas: 


e The total sum of squares (proportional to the variance of the data): 


SStot => (y, -y) ) 


i 
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e The regression sum of squares, also called the explained sum of 
squares: 


SSe =X (f -Y) 


i 


e The sum of squares of residuals, also called the residual sum of 
squares: 


e The general definition of the coefficient of determination or ris 


SO res 
SS 


R’ =1- 





tot 


In Figure 7-6, we can see the the interpetation of the sum of squares and how they 
come together to form the definition of the coefficient of determination. 


The 





Figure 7-6. Image Explaining Squared errors (taken from https://en.wikipedia.org/ 
wiki/Coefficient_of determination) 


R = 1- Blue Color/Red Color 

These small squares represent the squared residuals with respect to the linear 
regression. The areas of the larger squares represent the squared residuals with respect to 
the average value. 

On left the linear regression fits the data in comparison to the simple average, while on 
the right it fits the actual value of data. R? is then a ratio between them, indicating if rather 
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than taking simple average you use this model how much more you will be able to capture. 
Needless to say, a perfect value of 1 means all the variation is explained by the model. 
Since R’ is a proportion, it is always a number between 0 and 1. 


e IfR’=1, all of the data points fall perfectly on the regression line 
(or the predictor x accounts for all the variation in y) 


e  IfR?=0, the estimated regression line is perfectly horizontal (or 
the predictor x accounts for none of the variation in y) 


e IfR’is between 0 and 1, it explains variance in y (using the model 
is better than not using the model) 


Though R-square is the default output of all the standard linear regression packages, 
we will show you the calculations as well. Another term that you need to be aware is 
adjusted R-squared. It makes the correction for the number of predictors in the model. 
In other words it takes into account the overfitting of the model due to a high number of 
predictors, and it increases only if the new term improves the model more than would be 
expected by chance. 


#Model training data ( we will show our analysis on this dataset) 


train <-Data House Price[1:floor(nrow(Data House Price)*(2/3)),.(HousePrice 
, storeArea, StreetHouseFront , BasementArea, LawnArea, StreetHouseFront, LawnArea 
,Rating,SaleType) ]; 


#Omitting the NA from dataset 
train <-na.omit(train) 


# Get a linear regression model 
linear_reg model <-1m(HousePrice ~StoreArea +StreetHouseFront +BasementArea 
+LawnArea +StreetHouseFront +LawnArea +Rating +SaleType ,data=train) 


# Show the function call to identify what model we will be working on 


print(linear reg model$call) 
lm(formula = HousePrice ~ StoreArea + StreetHouseFront + BasementArea + 
LawnArea + StreetHouseFront + LawnArea + Rating + SaleType, 
data = train) 
#System generated Square value 
cat("The system generated R square value is 
Squared) 
The system generated R square value is 0.7155461 


, summary(linear reg model) ¢$r. 


You can see that the default model output calculated R-square for us. The current 
linear model has an R-square of 0.72. it can be interpreted as 72% percent of the variation 
in house price is “explained by” the variation in predictors StoreArea, StreetHouseFront, 
BasementArea, LawnArea, StreetHouseFront, LawnArea, Rating, and SaleType. 
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Here, we calculate the measure step by step to get the same R-square value. 
#calculate Total Sum of Squares 
SST <-sum((train$HousePrice -mean(train$HousePrice) )*2); 
#Calculate Regression Sum of Squares 
SSR <-sum((linear reg model$fitted.values -mean(train$HousePrice) )*2); 
#Calculate residual(Error) Sum of Squares 
SSE <-sum((train$HousePrice -linear reg model$fitted.values)*2) ; 

One of the important relationships that these three sum of squares share is 

SST = SSR + SSE 


You can test that on your own. Now we will use these values and get the R-square for 
our model: 


#calculate R-squared 
R Sqr <-1-(SSE/SST) 
#Display the calculated R-Sqr 


cat("The calculated R Square is ", R Sqr) 
The calculated R Square is 0.7155461 


You can see the calculated R-square is same as the 1m() function output. You can 
now see the calculations behind R-square. 

In this section, you saw some of the basic metrics that we can create around the 
errors (residuals) and interpreted them as a measure of how well our model will do on the 
actual data. In the next section, we will introduce techniques for discrete cases. 


7.6 Model Evaluation for Discrete Output 


In previous section, we introduced metrics for models where the dependent variable and 
predicted values were of continuous types. In this section, we will introduce some metrics 
for cases where the distribution is discrete. 

For this section, we will go back to our purchase prediction data and generate 
the metrics and discuss their interpretation. We will leverage the setup we created for 
population stability. 
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7.6.1 Classification Matrix 


A classification matrix is the most intuitive ways of looking at the performance of a 
classifier. This is sometimes also called a confusion matrix. Visually, this is a two way 
matrix with one axis showing the distribution of the actual class and the another axis 
showing a predicted class (see Figure 7-7). 


True Positive (TP) | False Negative(FN) 
False Positive(FP) | True Negative(TN) 


Figure 7-7. Two class classification matrix 





The accuracy of the model is calculated by the diagonal elements of the classification 
matrix, as they represent the correct classification by the classifier, i.e., the actual and 
predicted values are the same. 


Classification Rate = (True Positive + True Negative) / Total Cases 


Now we will show you the classification matrix and calculate the classification rate 
for our purchase prediction data. The method we will use for modeling probabilities is a 
multinomial logistic and the classifier will pick the highest probability. 


Note To avoid class imbalance problem, we will be using stratified sampling to create 
equal size classes for illustration of model performance concepts. A class imbalance 
problem causes the probabilities to bias toward the high frequency classes, and hence the 
classifier fails to allocate classes to low frequency classes. 


#Remove the data having NA. NA is ignored in modeling algorithms 
Data Purchase<-na.omit(Data Purchase) 


#Sample equal sizes from Data Purchase to reduce class imbalance issue 
library(splitstackshape) 

Data Purchase Model<-stratified(Data Purchase, group=e("ProductChoice"),size 
=10000, replace=FALSE) 


print("The Distribution of equal classes is as below") 


[1] "The Distribution of equal classes is as below" 
table(Data Purchase Model$ProductChoice) 
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1 2 3 4 
10000 10000 10000 10000 


Build the multinomial model on Train Data (Set_1) and then test data (Set 2) 
will be used for performance testing 


set.seed(917); 
train <-Data Purchase Model[sample(nrow(Data Purchase Model) ,size=nrow(Data_ 
Purchase Model)*(0.7), replace =TRUE, prob =NULL), ] 
dim(train) 
[1] 28000 6 
test <-Data Purchase Model[! (Data Purchase Model$CUSTOMER ID 
*instrain$CUSTOMER ID), | 
dim(test) 
[1] 20002 6 


Fit a multinomial logistic model 


library (nnet) 
mnl model <-multinom (ProductChoice ~MembershipPoints +IncomeClass 
+CustomerPropensity +LastPurchaseDuration, data = train) 
# weights: 68 (48 variable) 

initial value 38816.242111 

iter 10 value 37672.163254 

iter 20 value 37574.198380 

iter 30 value 37413.360061 

iter 40 value 37327.695046 

iter 50 value 37263.280870 

iter 60 value 37261.603993 

final value 37261.599306 

converged 


Display the summary of model statistics 


mnl_model 
Call: 
multinom(formula = ProductChoice ~ MembershipPoints + IncomeClass + 
CustomerPropensity + LastPurchaseDuration, data = train) 


Coefficients: 

(Intercept) MembershipPoints IncomeClass1 IncomeClass2 IncomeClass3 
2 11.682714 -0.03332131 -11.4405637 -11.314417 -11.307691 
3 -1.967090 0.02730530 0.9855891 1.644233 2.224430 
4 -1.618001 -0.12008110 1.5710959 1.692566 2.062924 


IncomeClass4 IncomeClass5 IncomeClass6 IncomeClass7 IncomeClass8 
2 -11.547647 -11.465621 -11.447368 -11.388917 -11.367926 
3 2.023594 2.119750 2.201136 2.169300 2.241395 
4 1.911509 2.062195 2.296741 2.249285 2.509872 
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IncomeClass9 CustomerPropensityLow CustomerPropensityMedium 


2 -12.047828 -0.4106025 -0.2580652 

3 1.997350 -0.8727976 -0.5184574 

4 2.027252 -0.6549446 -0.5105506 
CustomerPropensityUnknown CustomerPropensityVeryHigh 

2 -0.5689626 0.1774420 

3 -1.1769285 0.4646328 

4 -1.1494067 0.5660523 
LastPurchaseDuration 

2 0.04809274 

3 0.05624992 

4 0.08436483 


Residual Deviance: 74523.2 
AIC: 74619.2 


Predict the probabilities 


predicted test <-as.data.frame(predict(mnl model, newdata = test, 
type="probs")) 


Display the predicted probabilities 


head(predicted test) 

1 2 3 4 
0.3423453 0.2468372 0.2252361 0.18558132 
0.2599605 0.2755778 0.2546863 0.20977542 
0.4096704 0.2429370 0.2482094 0.09918326 
0.2220821 0.2485851 0.3188838 0.21044894 
0.4163053 0.2689046 0.1763766 0.13841355 
0.4284514 0.2626000 0.1948703 0.11407836 


AUAU NBE 


Do the prediction based in highest probability 
test_result <-apply(predicted test,1,which.max) 
table(test_ result) 
test_result 
1 2 3 4 
8928 1265 3879 5930 
Combine to get predicted and actuals at one place 


result <-as.data.frame(cbind(test$ProductChoice, test result) ) 


colnames(result) <-e("Actual Class", "Predicted Class") 
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head(result) 


Actual Class Predicted Class 
1 1 
1 2 
1 1 
1 3 
1 1 
1 1 


AUAU NBA 


Now when we have the matrix of actual versus predicted, we will create the 
classification matrix. Now we will calculate some key features of the classification matrix: 


Number of cases: Total number of cases or number of rows in test 
(n) 


Number of classes: Total number of classes for which prediction is 
done (nc) 


Number of correct classification: This is the sum over the diagonal 
of classification matrix (diag) 


Number of instances per class: This is the sum of all the cases in 
actual (rowsums) 


Number of instances per predicted class: This is the sum of all the 
cases in predicted (colsum) 


Distribution of actuals: The total of rowsums divided by the total 


Distribution of predicted: Total of colsums divided by the total 


Create the classification matrix 


cmat <-as.matrix(table(Actual = result$ Actual Class’, Predicted = 
result$ Predicted Class’ )) 


Calculated above mentioned measures in order 


n <-sum(cmat) ; 


cat("Number of Cases 


5 0); 


Number of Cases 20002 
nclass <-nrow(cmat); 


cat("Number of classes 


, nclass); 


Number of classes 4 
correct_class <-diag(cmat) ; 
cat("Number of Correct Classification ", correct class); 
Number of Correct Classification 3175 395 1320 2020 
rowsums <-apply(cmat, 1, sum); 
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cat("Number of Instances per class ", rowsums); 
Number of Instances per class 4998 4995 5035 4974 
colsums <-apply(cmat, 2, sum); 
cat("Number of Instances per predicted class ", colsums); 
Number of Instances per predicted class 8928 1265 3879 5930 
actual dist <-rowsums /n; 
cat("Distribution of actuals ", actual dist); 
Distribution of actuals 0.249875 0.249725 0.2517248 0.2486751 
predict dist <-colsums /n; 
cat( "Distribution of predicted ", predict dist); 
Distribution of predicted 0.4463554 0.06324368 0.1939306 0.2964704 


These quantities are calculated from the classification matrix. You are encouraged 
to verify these numbers and get good understanding of these quantities. Here is the 
classification matrix and classification rate for our classifier: 


Print the classification matrix - on test data 


print (cmat) 
Predicted 
Actual 1 2 3 4 
1 3175 312 609 902 
2 2407 395 825 1368 
3 1791 284 1320 1640 
4 1555 274 1125 2020 


Print Classification Rate 


classification rate <-sum(correct_class)/n; 
print(classification rate) 
[1] 0.3454655 


The classification rate is low for this classifier. A classification rate of 35% means that 
the model is classifying the cases incorrectly more than 50% of the time. The modeler 
has to dig into the reasons for the low performance of the classifier. The reasons can 
be the predicted probabilities, underlying variables explanatory power, a sampling of 
imbalanced classes, or may be method of picking the highest probability itself. 

The model performance here is helping us find out if the model is actually 
performing up to our standards? Can we really use this in an actual environment? What 
might be causing the low performance? This step becomes important for any machine 
learning exercise. 
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7.6.2 Sensitivity and Specificity 


Sensitivity and specificity are used to measure the model performance on positive and 
negative classes separately. These measures allow you to determine how the model 

is performing on the positive and negative populations separately. The mathematical 
notation helps clarify these measures in conjunction with the classification matrix: 


e Sensitivity: The probability that the test will indicate the True class 
as True among actual true. Also called True Positive Rate (TPR) 
and in pattern recognition called the precision. Sensitivity can be 
calculated from classification matrix (see Figure 7-7). 


Sensitivity, True Positive Rate = Correctly Identified Positive/ 
Total Positives = TP/(TP+FN) 


e Specificity: Probability that the test will indicate that the False 
class and False are among an actual False. Also called the True 
Negative Rate (TNR) and in pattern recognition, called recall. 
Specificity can be calculated from classification matrix (see 
Figure 7-7). 

Specificity, True Positive Rate = Correctly Rejected/Total 
Negatives = TN/(TN+FP) 


Sensitivity and specificity are characteristics of the test. The underlying population 
does not affect the results. For a good model, we try to maximize both TPR and TNR, and 
the Receiver Operating Characteristic (ROC) helps in this process. Receiver Operating 
Curve is a plot between sensitivity and (1- specificity), and the highest point on this curve 
provide the cutoff which maximizes our classification rate. We will discuss the ROC curve 
in the next section and connect it back to optimizing sensitivity and specificity. 


Note Sensitivity and specificity are calculated per class. For a multinimial class, we 
tend to average out the quantity over the classes to get a single number for the whole 
model. For illustration purposes, we will show the analysis by combining the classes into a 
two-class problem. You are encouraged to extend the concept to a full model. 


The analysis is shown for ProductChoice == 


Actual Class <-ifelse(result$ Actual Class ==1,"One","Rest"); 
Predicted Class <-ifelse(result$ Predicted Class ==1, "One", "Rest"); 


ss analysis <-as.data.frame(cbind(Actual Class,Predicted Class)); 


Create classification matrix for ProductChoice == 1 
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cmat_ProductChoice1 <-as.matrix(table(Actual = ss_analysis$Actual Class, 
Predicted = ss analysis$Predicted Class)); 


print(cmat_ProductChoice1) 
Predicted 

Actual One Rest 

One 3175 1823 

Rest 5753 9251 
classification rate ProductChoice1 <-sum(diag(cmat_ProductChoice1) )/n; 
cat("Classification rate for ProductChoice 1 is " 
ProductChoice1) 
Classification rate for ProductChoice 1 is 0.6212379 


, Classification rate_ 


Calculate TPR and TNR 


TPR <-cmat_ProductChoice1[1,1]/(cmat_ProductChoice1[1,1] +cmat_ 
ProductChoice1[1,2]); 


cat(" Sensitivity or True Positive Rate is ", TPR); 

Sensitivity or True Positive Rate is 0.6352541 
TNR <-cmat_ProductChoice1[2,2]/(cmat_ProductChoice1[2,1] +cmat_ 
ProductChoice1[2,2]) 


cat(" Specificity or True Negative Rate is ", TNR); 
Specificity or True Negative Rate is 0.6165689 


The result shows that for ProductChoice == 1 our model is able to correctly classify 
in total 63% of cases, among which it is able to identify 61% as “one” from a population of 
“one” and 62% as “rest” from a population of “rest” The model performance is better in 
predicting “rest” from the population. 


7.6.3 Area Under ROC Curve 


A receiver operating characteristic (ROC), or ROC curve, is graphical representation 
of the performance of a binary classifier as the threshold or cutoff to classify changes. 
As you saw in the previous section that for a good model we want to maximize two 
interdependent measures TPR and TNR, the ROC curve will show that relationship. The 
curve is created by plotting the true positive rate (TPR) against the false positive rate 
(FPR) at various cutoffs or threshold settings. 
However, as we are using a multiclass classifier, we are not using a cutoff to classify. 
You are encouraged to rebuild the multi-class model as a binary model (one and rest) for 
other ProductChoices/classes, and then use the built-in functions of the ROCR package. 
Here we show ROC curve and Area Under the Curve (AUC) value, assuming the 
model only had two classes: ProductChoice “One” and “Rest”. This will give us a scale of 
cutoffs if we were to use only probability for class “One”/1. Observe in the following code 
we are recreating the model to change the multi-class problem into a binary classification 
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problem. Essentially, the probability scale multinomial distributes the probabilities 
among classes in such a way that the sum is 1, while for ROC we need full range of 
probabilities for a class to play with the threshold/cutoff values of classification. 

For illustration purposes, we will use our purchase prediction data with only two 
classes of choices—0 or 1—defined here: 


e 1ifthe customer chooses product 1 from a catalog of four 
products; this forms our positives 


e Oifthe customer chooses any other product than 1; this forms our 
negatives 


Here we create binary logistic model with this definition. 


# create a the variable Choice_binom as above definition 
train$ProductChoice binom <-ifelse(train$ProductChoice ==1,1,0); 
test$ProductChoice binom <-ifelse(test$ProductChoice ==1,1,0); 


Fit a binary logistic model on the modified dependent variable, 
ProductChoice binom. 


glm ProductChoice binom <-glm( ProductChoice binom ~MembershipPoints 
+IncomeClass +CustomerPropensity +LastPurchaseDuration, data=train, family 
=binomial (1ink="logit") ) 


Print the summary of binomial logistic model 
summary(glm ProductChoice binom) 


Call: 
glm(formula = ProductChoice binom ~ MembershipPoints + IncomeClass + 
CustomerPropensity + LastPurchaseDuration, family = binomial(link = 
"logit"), 
data = train) 


Deviance Residuals: 
Min 10 Median 30 Max 
-1.2213 -0.8317 -0.6088 1.2159 2.3976 


Coefficients: 

Estimate Std. Error z value Pr(>|z]|) 
(Intercept) -13.360676 71.621773 -0.187 0.852 
MembershipPoints 0.038574 0.005830 6.616 3.68e-11 *** 
IncomeClass1 12.379912 71.622606 0.173 0.863 
IncomeClass2 12.142239 71.622424 0.170 0.865 
IncomeClass3 11.881615 71.621801 0.166 0.868 
IncomeClass4 12.086976 71.621763 0.169 0.866 
IncomeClass5 11.981304 71.621759 0.167 0.867 
IncomeClass6 11.874714 71.621761 0.166 0.868 
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IncomeClass7 11.879708 
IncomeClass8 11.759389 
IncomeClass9 12.214044 
CustomerPropensityLow 0.650186 
CustomerPropensityMedium 0.435307 
CustomerPropensityUnknown 0.952099 
CustomerPropensityVeryHigh -0.430576 
LastPurchaseDuration -0.062538 
Signif. codes: 0 ‘***' 0.001 ‘**' 


0.01 


Uk! 


.621765 
.621792 
.622000 
-054060 
.054828 
.048078 
.065156 
. 003409 


0.05 


0.166 
0.164 
0.171 
12.027 
7.939 
19.803 
-6.608 


-18.347 


. 0.1 


(Dispersion parameter for binomial family taken to be 1) 


Null deviance: 31611 on 27999 degrees of freedom 
Residual deviance: 29759 on 27984 degrees of freedom 


AIC: 29791 


Number of Fisher Scoring iterations: 11 


< 2e-16 
2.03e-15 
< 2e-16 
3.89e-11 
< 2e-16 


0.868 
0.870 
0.865 


1 


We will be using RORC library in R to calculate the Area Under the Curve (AUC) and 
to create the Receiver Operating Curve (ROC). The ROCR package helps to visualize the 
performance of scoring classifiers. 


Now create the performance dataset to create AUC curve 


library (ROCR) 


test_binom <-predict(glm ProductChoice binom,newdata=test, type 
pred <-prediction(test_binom, test$ProductChoice binom) 


perf <-performance(pred, "tpr 


Calculating AUC 


wow 
32 


fpr") 


auc <-unlist(slot(performance(pred,"auc"),"y.values")); 


cat( "The Area Under ROC curve for this model is 
0.6699122 


The Area Under ROC curve for this model is 


Plotting the ROCcurve 


library(ggplot2) 
library(plotROC) 


" auc); 


debug <-as.data.frame(cbind(test_binom,test$ProductChoice binom) ) 
ggplot(debug, aes(d = V2, m = test_binom)) +geom_roc() 
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Figure 7-8. ROC curve 


We used a ggplot() object and plotROC library to plot the ROC curve with cutoff 
values highlighted in the plot for easy reading (see Figure 7-8). 

In the plot, we want to balance between true positive and false positive, and 
maximize the true positive while minimizing the false positive. This point will be the best 
cutoff/threshold value that you should use to create the classifier. Here, you can see that 
the value is close to 0.2—true positive is ~74% while false positive is ~48%. 

Chapter 6 discussed the use of this optimal value, i.e., 0.2 to use as a cutoff for a binary 
classifier. Refer to that chapter’s logistics regression discussion. The ROCR R package 
details are available at https: //cran.r-project.org/web/packages/ROCR/ROCR. pdf. 


7.7 Probabilistic Techniques 


Generally, there is no such specific classification of model performance techniques 
into probabilistic and otherwise. However, it is helpful for you to understand how 
more complicated methods are emerging for model performance testing. Probabilistic 
techniques are those which are based on sampling and simulations. These techniques 
differ from what we discussed in previous sections; in previous sections we had residuals 
with us to create metrics. In probabilistic techniques, we will be simulating and sampling 
subsets to get a robust and stable model. 

In this section, we will touch at a very high level the two techniques corresponding 
to two major buckets of probabilistic tools that data scientists have at their disposal, 
although both are resampling based techniques: 


e Simulation based: K-fold cross validation 
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e Sampling based: Bootstrap sampling 


A very good understanding of these concepts is provided by Ron Kohavi, Stanford 
in a much celebrated paper “A Study of Cross-Validation and Bootstrap for Accuracy 
Estimation and Model Selection,’ International Joint Conference on Artificial Intelligence 
(IJCAI), 1995. The readers interested in this topic should read this paper. In this section, 
we will touch on these ideas from the perspective of using them in R. 


7.7.1 K-Fold Cross Validation 


Cross validation is one of the most used techniques for model evaluation and lately 
has been accepted as a better technique than residual-based metrics. The issue with 
residual-based methods is that you need to keep a test set, and just with one test set they 
don’t exactly tell you how the model will behave on unseen data. So while train, test, and 
validate methods are good, probabilistic simulation and sampling provide us more ways 
to test that. 

K-fold cross validation is very popular in the machine learning community. The 
greater the number of folds, the better the interpretation (recall the Law of Large 
Numbers). Steps to execute k-fold cross validation include: 


Step 1: Divide the dataset into k subsets. 

Step 2: Train a model on k-1 subsets. 

Step 3: Test the model on remaining one subset and calculate the error. 
Step 4: Repeat Steps 1-3 until all subsets are used exactly once for testing. 


Step 5: Average out the errors by this scenario simulation exercise to get the cross- 
validation error. 


The advantage of this method is that the method by which you create the k-subsets 
is not that important compared to same situation in the train/test (or holdout cross 
validation) method. Also, this method ensures that every data point gets to be in a test 
set exactly once, and gets to be in a training set k-1 times. The variance of the resulting 
estimate is reduced as k is increased. 

The disadvantage of this method is that the model has be to estimated k-times and 
then testing done for k-times, which means a higher computation cost (computation cost 
is proportional to number of folds). A variant randomly splits the data and controls each 
fold size. The advantage of doing this is that you can independently choose how large 
each test set is and how many trials you average over. 


Note In cross-validation techniques we don't keep train and test subsets. Usually, data 
scientists keep a validation set outside the cross-validation to test the model final model fit. 
In our example, we will treat our train as train set and test as validation dataset. 
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Let’s show an example with our house sales price problem. You are encouraged to 
apply the same techniques on the classification problems as well. 


library (caret) 
library (randomForest) 
set.seed(917); 


Model training data (we will show our analysis on this dataset) 


train <-Data House Price[1:floor(nrow(Data House Price)*(2/3)),.(HousePrice 
, storeArea, StreetHouseFront , BasementArea, LawnArea, StreetHouseFront, LawnArea 
,Rating,SaleType) ]; 


Create the test data which is set 2 

test <-Data House Price[floor(nrow(Data House Price)*(2/3) +1):nrow(Data_ 
House Price), .(HousePrice, StoreArea, StreetHouseFront, BasementArea, LawnArea, S 
treetHouseFront, LawnArea, Rating, SaleType) | 

Omitting the NA from dataset 


train <-na.omit(train) 
test <-na.omit(test) 


Create the k subsets, let's take k as 10 (i.e., 10-fold cross validation) 


k 10 fold <-trainControl(method ="repeatedcv", number =10, savePredictions 
=TRUE) 


Fit the model on folds and use rmse as metric to fit the model 

model fitted <-train(HousePrice ~StoreArea +StreetHouseFront +BasementArea 
+LawnArea +StreetHouseFront +LawnArea +Rating +SaleType, data=train, family 
= identity,trControl = k_10 fold, tuneLength =5) 


Display the summary of the cross validation 


model fitted 
Random Forest 


712 samples 
6 predictor 


No pre-processing 

Resampling: Cross-Validated (10 fold, repeated 1 times) 
Summary of sample sizes: 642, 640, 640, 641, 640, 641, ... 
Resampling results across tuning parameters: 
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mtry RMSE Rsquared 
2 40235.04 0.7891003 
4 37938.62 0.7961153 
6 38049.31 0.7927441 
8 38132.67 0.7914360 

10 38697.45 0.7858166 


RMSE was used to select the optimal model using the smallest value. 
The final value used for the model was mtry = 4. 


You can see from the summary that the model selected by cross-validation has a 
higher R? than the one we created previously. The new R-square is 80% and the old was 
72%. Also, notice that the default metric to choose the best model is RMSE. You can 
change the metric and function type based on the need and the optimization function. 


7.7.2 Bootstrap Sampling 


We have already discussed the bootstrap sampling concepts in Chapter 3. We are just 
extending the idea to our problem here. Based on random samples from our data we will 
try to estimate the model and see if we can reduce the error and get the high-performance 
model. When we use these techniques as a performance evaluation technique, you can 
see we already have fixed the model, i.e., the predictors, and trying to see probabilistically 
what gives the best performance and how much. 

For showing the bootstrap example we will extend what we showed for cross 
validation. 


Create the the boot experiment, let's take samples as as 10 (i.e., 10-sample 
bootstrapped) 


boot 10s <-trainControl(method ="boot", number =10, savePredictions =TRUE) 
Fit the model on bootstraps and use rmse as metric to fit the model 

model fitted <-train(HousePrice ~StoreArea +StreetHouseFront +BasementArea 
+LawnArea +StreetHouseFront +LawnArea +Rating +SaleType, data=train, family 
= identity,trControl = boot 10s, tuneLength =5) 

Display the summary of the boost raped model 


model fitted 
Random Forest 


712 samples 
6 predictor 


No pre-processing 
Resampling: Bootstrapped (10 reps) 
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Summary of sample sizes: 712, 712, 712, 712, 712, 712, ... 
Resampling results across tuning parameters: 


mtry RMSE Rsquared 
2 40865.52 0.7778754 
4 38474.68 0.7871019 
6 38818.70 0.7819608 
8 39540.90 0.7742633 

10 40130.45 0.7681462 


RMSE was used to select the optimal model using the smallest value. 
The final value used for the model was mtry = 4. 


In the bootstrapped case, you can see that the best model is having a R? of 79%, 
which is still higher than the 72% in previous case but less than the 10-fold cross 
validation one. One important thing to note is that the bootstrap samples run again and 
again for model estimation, but cross validation main exclusivity of subsets in each run. 

The probabilistic methods are complex and difficult to understand. It is 
recommended that only experienced data scientist use them, as an in-depth 
understanding on the machine learning algorithm is required to set these experiments 
and interpret them properly. The next chapter on parameter tuning is an extension of the 
probabilistic techniques that we discussed here. 


7.8 The Kappa Error Metric 


In recent days, machine learning practitioners are trying a lot of new and complicated 
error metrics for evaluation as well as model creation. These new error metrics are 
important, as they solve for some specific business problems/objectives. With high 
computing power, we can frame our own optimization function and apply the iterative 
algorithm with data. 

Kappa or cohen’s kappa coefficient is a statistic that measures the relationship 
between observed accuracy and expected accuracy. Jacob Cohen introduced Kappa in 
a paper published in the Journal Educational and Psychological Measurement in 1960. 
A similar statistic, called Pi, was proposed by Scott (1955). Cohen’s Kappa and Scott’s Pi 
differ in terms of how the expected probability is calculated. This method found the first 
use Case in inter-rater agreements, with different raters rating the same cases in different 
buckets. 

In the machine learning world, the Kappa is adopted to compare a pure random 
chance with a model. This type of metric is very effective in cases of imbalanced 
classification. For example, suppose your training data has 80% “Yes” and 20% “No” 
Without a model, you can still achieve up to 80% accuracy in classification (diagonal) if 
you simply assign everyone a “Yes”. 

A more formal definition of Kappa is given here. 

Cohen’s Kappa measures the agreement between random approach and modeled 
approach, where each classify N items into C mutually exclusive categories. 
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The equation for x is: 


g= Po—Pe -4P 


l-p, l-p, 


where pois the relative observed agreement among two approaches, and pe is the 
hypothetical probability of a chance overlap, using the observed data to calculate the 
probabilities of each approach randomly selecting each category. If the approaches are 
in complete agreement then x = 1. If there is no agreement among the approaches other 
than what would be expected by chance (as given by pe), x. 

For more detailed reading, refer to Fleiss, J. L. (1981) Statistical Methods for Rates 
and Proportions, 2nd ed. (New York: John Wiley) and Banerjee, M.; Capozzoli, Michelle; 
McSweeney, Laura; Sinha, Debajyoti (1999), “Beyond Kappa: A Review of Interrater 
Agreement Measures” from The Canadian Journal of Statistics. 

We will use the purchase prediction data with a very simple model to illustrate the 
Kappa and accuracy measure. The caret () package is used to show this example. This 
package provides a unified way of training and evaluation of almost 270 different kinds of 
models. The details of this package are provided in Chapter 8. 


library (caret) 
library (mlbench) 


Below we randomly sample 5000 cases to make the computation faster. 
set.seed(917); 

train kappa <-Data Purchase Model[sample(nrow(Data Purchase 
Model) ,size=5000, replace =TRUE, prob =NULL), | 


train() function confuses between numeric levels, hence convert the 
dependent into text i.e., 1->A, 2->B, 3-> C and 4->D 


train kappa$ProductChoice multi <-ifelse(train_ kappa$ProductChoice ==1, "A", 
ifelse(train_ kappa$ProductChoice ==2, "B", 

ifelse(train_ kappa$ProductChoice ==3,"C","D"))); 

train kappa <-na.omit(train kappa) 

Set the experiment 

cntrl <-trainControl(method="cv", number=5, classProbs =TRUE) 

Below the distribution shows that number of cases with each purchase history 


Distribution of ProductChoices 


table(train_kappa$ProductChoice_ multi) 


A B C D 
1271 1244 1260 1225 
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Making the column names as legitimate names 
colnames(train kappa) <-make.names(names(train kappa), unique =TRUE, allow_ =TRUE) 
Convert all the factors into factors in R 


train kappa$ProductChoice multi <-as.factor(train kappa$ProductChoice multi) 
train kappa$CustomerPropensity <-as.factor(train kappa$CustomerPropensity ) 
train kappa$LastPurchaseDuration <-as.factor(train_ 
kappa$LastPurchaseDuration) 


Now, the following code will create a random forest model for our sample 
data. Fit the model with method as RandomForest. 


model fitted <-train(ProductChoice multi ~CustomerPropensity 
+LastPurchaseDuration, data=train kappa, method="rf", metric="Accuracy” ,trC 
ontrol=cntrl) 


The result displayed the kappa metrics 


print(model fitted) 
Random Forest 


5000 samples 
2 predictor 
4 classes: ‘A’, 'B', 'C', 'D' 


No pre-processing 

Resampling: Cross-Validated (5 fold) 

Summary of sample sizes: 4000, 3999, 4000, 4001, 4000 
Resampling results across tuning parameters: 


mtry Accuracy Kappa 

2 0.3288009 0.1036580 
10 0.3274019 0.1024999 
19 0.3268065 0.1017419 


Accuracy was used to select the optimal model using the largest value. 
The final value used for the model was mtry = 2. 
Create the predicted values and show that in classification matrix 


pred <-predict(model fitted, newdata=train kappa) 
confusionMatrix(data=pred, train_kappa$ProductChoice multi) 
Confusion Matrix and Statistics 
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Reference 


Prediction A B C D 


A 830 653 475 427 
B 97 133 108 85 
C 134 179 304 210 
D 210 279 373 503 


Overall Statistics 


Accuracy : 0.354 
95% CI : (0.3407, 0.3674) 


No Information Rate : 0.2542 
P-Value [Acc > NIR] : < 2.2e-16 


Kappa : 0.1377 


Mcnemar's Test P-Value : < 2.2e-16 


Statistics by Class: 


Class: A Class: B Class: C Class: D 


Sensitivity 0.6530 0.1069 0.2413 0.4106 
Specificity 0.5830 0.9228 0.8602 0.7717 
Pos Pred Value 0.3480 0.3144 0.3676 0.3685 
Neg Pred Value 0.8314 0.7573 0.7709 0.8014 
Prevalence 0.2542 0.2488 0.2520 0.2450 
Detection Rate 0.1660 0.0266 0.0608 0.1006 
Detection Prevalence 0.4770 0.0846 0.1654 0.2730 
Balanced Accuracy 0.6180 0.5149 0.5507 0.5911 


From an interpretation point of view, the following guidelines can be used: 


Poor agreement when kappa is 0.20 or less 

Fair agreement when kappa is 0.20 to 0.40 
Moderate agreement when kappa is 0.40 to 0.60 
Good agreement when kappa is 0.60 to 0.80 
Very good agreement when kappa is 0.80 to 1.00 


In this model output, the Kappa value is 0.1377, which implies that there is poor 
agreement between a random model and our model. Our model results differ from the 
random model. Now, there can be two possibilities, the our model performing worse than 
the random model or it performing exceptionally well. Looking at the accuracy measure, 
35.4% looks like our model did not do a good job in classification. We need more data and 
features to get a good model. 


462 


CHAPTER 7 = MACHINE LEARNING MODEL EVALUATION 


7.9 Summary 


Model evaluation is a very intricate subject. This chapter just scratched the surface to get 
the reader started on the idea of model evaluation. The model evaluation subject brings 
a lot of depth to the measures we use to evaluate the performance. In this ever-changing 
analytics landscape, business are using models for different purposes, sometimes in 
custom ways to model a problem to help make business decisions. This trend in industry 
has given rise to the competitive nature of evaluation measures. 

To solve a business problem in a real setting, you have to optimize two different 
objective functions: 


e Statistical measure, the one we discussed in this chapter 
e Business constraints, a problem/business specific measures 


Let’s try to understand these constraints on the model performance by a example. 
Suppose you have to build a model to classify customers into eight buckets. However, the 
cost of dealing with each bucket of customer is different. Serving a customer in bucket 8 
is 10 times more costly than serving someone from bucket 1. Similar to this is the cost of 
each bucket varies with the bucket number and with some other factors. 

Now if the business decides to use a model to classify the objects into these 
classes, how will you evaluate the performance of the mode? A pure statistical measure 
of performance might not fit the situations. How we can think about creating hybrid 
performance metrics, or a serial dependent matrix. The concept of evaluation is a 
very deep and fairly involved one. Data scientists have to come up with creative and 
statistically valid metrics to suit business problems. 

This chapter introduced the concept of population stability index, which confirms 
if we can use use the model for prediction. Then we classified our evaluation metrics 
into continuous and discrete cases. The continuous metrics discussed were different 
functions of residuals, i.e., mean absolute error, root mean square error, and R-square. 
The discrete set of measures included classification rate, sensitivity and specificity, and 
area under the ROC curve. We used our house price data and purchase prediction data to 
show evaluation metrics examples. 

These evaluation techniques are more suited to statistical learning models, 
the advanced machine learning models do not have any distribution constraints 
and cannot be evaluated and interpreted on conventional metrics. We introduced 
probability methods to evaluate machine learning models, i.e., cross validation and 
bootstrap sampling. These two methods form the backbone of machine learning model 
performance evaluation. 

In the end we discussed an important metric for multi-class problems, the 
Kappa metric. This metric is gaining in popularity as, in classification problems, each 
misclassification has a different cost associated with it. Hence, we need to measure 
performance in a relative manner. 

The model performance and evaluation techniques are evolving quickly. The 
performance metrics are becoming multigoal optimization problems and hence are also 
helping the algorithms adopt to new ways to fit data. We will continue with some more 
advanced topics in the next chapter, where we will introduce the difference between 
statistical learning and machine learning, including how this difference allow us to 
do more with the data and then how to go about improving the model performance 
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using ensemble techniques. The next chapter introduces the tradeoff between bias and 
variance, to help us understand the limits of what can be achieved in performance with 
given constraints. 
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CHAPTER 8 


Model Performance 
Improvement 






Model performance is a broad term generally used to measure how the model performs 
on a new dataset, usually a test dataset. The performance metrics also play the role of 
thresholds to decide whether the model can be put into actual decision making systems 
or needs improvements. In the previous chapter, we discussed some performance metrics 
for our continuous and discrete cases. In this chapter, we will discuss how changing the 
modeling process can help us improve model performance on the metrics. 

Feature selection plays an important role in modeling development process. It is the 
features that have information to explain the dependent variable. Data scientists spend 
a lot of time selecting and creating features for fitting predictive models. The feature 
engineering process involves selection of a best set of features and their transformations. 
These sets of features are then fed into a algorithm to quantify the relationships. The 
algorithm learns from the data and creates a predictive model. The performance of such 
a model is then evaluated based on some kind of error measure. Model performance 
improvement methods are then applied to boost the performance on the error metrics of 
interest. The higher levels properties of a model, e.g., complexity and speed of learning, 
also impact the model performance. These high-level parameters are known as hyper- 
parameters. We will discuss hyper-parameters more in the following sections. Broadly 
there are two ways to improve the model performance, specifically in machine learning 
algorithms: 


e Add more features and improve the quality of data 
e Optimize the hyper-parameters 


This first point is what we have been discussing so far in the book. However, we 
also discussed some algorithms where the learning process is influenced by hyper- 
parameters, e.g., in decision trees, the depth of the tree, the number of folds in cross 
validation, etc. Now these parameters are independent of the features and influence the 
model performance. For instance, you can have two different decision tree models using 
the same set of predictors but different hyper-parameters to train them. To understand 
the performance optimization process, we need to understand the trade of between bias 
and variance. Bias refers to the difference between the true and predicted values, while 
variance refers to the spread around the mean of predicted values. Bias and variance 
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are the two vital components of imprecision/performance in predictive models, and 
in general there is a tradeoff between them. The tradeoff is nonlinear, which means 
normally reducing one leads to increasing the other. 
This chapter will look at these issues and provide illustrations in R to equip you on 
how to implement some of the popular performance improvement techniques using R. 
The content of this chapter is oriented toward broader awareness of the latest 
developments in the computational world due to increased computational power and 
business acceptance of concepts. The dataset for this chapter is the same as the previous 
chapter (purchase prediction and house sale price), as we show you how the concepts 
from this chapter influence the results from previous metrics. 


Note The R illustrations in this chapter are computationally heavy, so you are advised to 
check the machine configuration before running these examples. 


While we try to balance out simplicity and completeness in this chapter, we expect 
the user of these techniques to have good understanding of numerical computing and 
the machine learning algorithm. References to research papers will be shared for detailed 
reading on statistical underpinnings. 

Learning objectives for this chapter: 


e Machine learning and statistical modeling 
e Overview of Caret package 

e Introduction to hyper-parameters 

e Hyper-parameter tuning illustrations 

e Bias versus variance tradeoffs 

e Introduction to ensemble learning 

e Advanced methods in ensemble learning 


e Advanced topic: Bayesian optimization 


8.1 Machine Learning and Statistical Modeling 


The comparison of machine learning and statistical modeling has been a key debate topic 
in recent times. Machine learning has become a very popular term, and this is getting 
stronger as the computational power is increasing. In this section of chapter, we try to 
express the opinion based on some of the learning arguments in this debate. The core of 
this debate does not divide machine learning and statistics into two exclusive groups but 
it will make you more aware about how you can solve a problem with data. 
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At core of machine learning/statistical modeling is quantifying the relationship 
between the response variable and predictors. Mathematically, the relationship can be 
written as a function: 


Y=f(X)+e 

Where 

f(): Function of X 

X: An input vector with X1, X1.Xn. 
Y: Output 

E: The random error 


The way we treat the estimation problem is what differentiates machine learning 
from statistical modeling. Machine learning is an algorithm that can learn this 
relationship without relying on any rule-bases programming. Statistical modeling will 
estimate the relationship based on formal quantification from statistical inferences 
(confidence interval, hypothesis testing, distributions, etc.). The process of statistical 
inference quantifies the process by which data is generated, while machine learning will 
emphasize how the final predictions will look if similar data is supplied in the future. 

Statistical learning terminology also differs from machine learning, for instance we 
say estimation in statistics but learning in machine learning. There are other numerous 
cases where the terminology is different due to the fact that the origin of achieving same 
objective has been different. Here are a few more examples: 


e = Classifier -> Hypothesis 
e  Regression/Classification -> Supervised Learning 
e Clustering -> Unsupervised Learning 


Robert Tibshirani, a statistician and machine learning expert at Stanford, says 
machine learning is a glamorous version of statistics. Although statistical analysis and 
methodology is the predominant approach in modern machine learning, not all machine 
learning methods are based on probabilistic models, e.g., SVMs and non-negative matrix 
factorization. 

Machine learning is also computationally costly and needs more computing power, 
which helps in solving many complex problems. One more difference is the size of 
data normally in these two fields—statistics usually deals with low dimensional spaces 
while machine learning is used in higher dimensional space. When we have hundreds 
of features and millions of data points, so upholding statistical principles becomes 
impossible. In such situations, we employ techniques that are based on salable and less 
assumption learning methods. 

The machine learning tools and techniques are capable of learning from trillions of 
observations one by one. They make predictions and learn simultaneously. Algorithms 
like random forest and gradient boosting are exceptionally robust and fast, with a wide 
variety (high dimension/features) and depth of features (a high number of observations). 
However, statistical modeling is generally applied for smaller datasets with fewer attributes 
or they end up overfitting. Also, these methods are spared from the assumptions that are 
required in statistical learning. Machine learning algorithms in general can be used with 
any distribution and/or with any boundary conditions to train a model. 
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The best analogy so far comes from nature and the way humans learn. We don't 
learn things around us based on assumptions but learn from trials. Similarly, machine 
learning is an adaption of learning from multiple iterations, in which for each iteration 
we try to get close to the actual values. As the guiding principle for machine learning is 
to replicate a system, its predictive power is generally very strong. This helps in putting 
all the variables before knowing their relation to the response variable, so the algorithm 
takes care of any misfit variable. However, statistical models are mathematics-intensive 
and based on coefficient estimation. They require the modeler to understand the 
relationship between variables before putting it in. 

In a nutshell, machine learning is not as deterministic as a statistical modeling. With 
the scope of learning, it becomes very important how we ask our machine algorithm to 
learn from data. We can actually influence the model performance by managing the rules 
of how the machine should learn. While for statistical models the options are limited to 
inputs and preset assumptions for the statistical method. 


8.2 Overview of the Caret Package 


The Caret package is one of the most powerful packages in R. This package allows 

users to explore the machine learning algorithms to their fullest potential. The Caret 
package (short for classification and regression training) contains functions for complex 
regression and classification problems. The package has a dedicated Git page and is one 
of the actively updated and documented packages of R. The Caret package is created and 
maintained by Max Kuhn from Pfizer. 

The Caret package has a lot of dependencies on other R packages. The required 
packages are only loaded when required and hence save a lot of overhead time and 
computational power. For instance, randomForest library is loaded only if you use rf as 
one of the model methods. You can install this package with or without the dependencies. 
You can install it including all dependent 27 packages using the suggests field; otherwise 
Caret loads packages as needed and assumes that they are installed. 


install. packages("caret", dependencies =e("Depends", "Suggests") ) 


You are encouraged to visit the Caret project page and keep the updated information 
from there. The home page of the project is at http: //caret.r-forge.r-project.org/ 
and the Git page can be accessed at http: //topepo.github.io/caret/index.html. 

The Caret package has numerous functions for model development and evaluation 
metrics for performance measurement. Being a comprehensive package it can be used 
for other techniques in sampling and also for sophisticated feature selection processes. 
There are two of the most important function/tools in the Caret package: 


e trainControl() 


e train() 
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The trainControl() function is like a wrapper that defines the rule for model 
training and the conditions around how sampling and grid search is to be done. The 
train() function is very powerful function that can support 230 types of models available 
in the Caret package. The primary function/tool, train(), can be used for: 


e Model evaluation, using cross validation, resampling, and other 
conventional metrics. It also can be used to measure the effect of 
tuning parameters in performance. 


e Model selection by choosing the best model based on optimal 
parameters, so multiple metrics can be calculated to choose the 
final model. 


e Model estimation using any of the 230 types of models listed in 
the train model list with default parameters or tuned ones. 


By default, the function automatically chooses the tuning parameters associated 
with the best value, although different algorithms can be used to tune the parameters 
(Source: http: //topepo.github.i0/caret/model-training-and-tuning.htm1). 


1 Define sets of model parameter values to evaluate 
2 for each parameter set do 


3 for each resampling iteration do 

4 Hold—out specific samples 

5 Optional] Pre—process the data 

6 Fit the model on the remainder 

7 Predict the hold—out samples 

8 end 

9 Calculate the average performance across hold—out predictions 


10 end 
11 Determine the optimal parameter set 
12 Fit the final model to all the training data using the optimal parameter set 


Figure 8-1. The train() function algorithm in the Caret package 


In general, the basic use of the Caret package includes first defining trainControl() 
and then calling the train() function. Here we show the generic syntax of calling these 
two functions in order to use the Caret functionality. 

Others are available, such as repeated K-fold cross validation, leave-one-out, etc. The 
function train control can be used to specify the type of resampling. By default, a simple 
bootstrap resampling is used: 


rfControl <-trainControl(# Example, 10-fold Cross Validation 

method ="repeatedcv', # Others are available, such as repeated K-fold 
cross-validation, leave-one-out etc 

number =10, # Number of folds 

repeats =10# repeated ten times 


) 
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The first two arguments to train are the predictor and outcome data objects, 
respectively. The third argument, method, specifies the type of model (see train model 
list or train models by tag). Here is an example that fits a randomForest model via the 
randomForest package, which was tested with 10-fold cross validation: 


set.seed(917) 

randomForectFit1 <-train(Class ~., # Define the model equation 

data = training, # Define the modeling data 

method ="rf", # List the model you want to use, caret 
provide list of options in train 
Model list 

trControl = rfControl, # This defines the conditions on how to 

control the training 

) # Other options specific to the 

modeling technique 

randomForectFit1 


More information about trainControl is given in a later section. Details can be 
found at http: //topepo. github. io/caret/model-training-and-tuning.html. 

As this is the core package in R, it deals with almost all of the machine learning 
techniques. Therefore, it’s important to keep in mind its functionality. 


Note In this chapter we will not be using the full dataset. The illustrations will be on 
smaller set of data to make sure you can replicate the results on less powerful machines. 


8.3 Introduction to Hyper-Parameters 


In machine learning, we deal with two kind of parameters, ones that are the standard 
model parameters and ones that are the hyper-parameters. The core difference between 
these two types of parameters is that model parameters can be directly learned from the 
underlying data and hyper-parameters cannot. The machine learning model training 
process is used to learn the data and then fit the model parameters. 

However, the hyper-parameters are not directly learned from the data and are 
actually very influential in model performance. Hyper-parameters explain the "higher- 
level" properties of the model such as its complexity, how fast it should learn, and how 
much depth it should go into. Another important thing is that hyper-parameters are 
fixed before training starts, hence, the model standard parameters are learned. We can 
say that hyper-parameters decide the rules of model training by which model standard 
parameters are estimated. 

Now, how are the hyper-parameter decided? What influences the hyper-parameter 
selection process? This area is summed up as hyper-parameter optimization and will be 
touched upon at a high level in an upcoming section. 
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The hyper-parameters differ from the standard model parameters (or coefficients). 
Some of the properties of hyper-parameter are listed here: 


Explain higher level properties: Define the complexity of model, 
capacity to learn, optimization criteria, etc. 


Not directly learned: They cannot be learned from underlying 
data, like the standard model parameters can. They are the 
property of machine learning algorithm and the learning space 
and need to be predefined. 


Iterative optimization: They can be set at different values and then 
evaluated on model performance, so in the most primitive sense, 
they can be optimized by iteratively finding the value that tests 
better. 


Another way to look at hyper-parameters is as a prerequisite for a Bayesian approach 
to statistical learning, which involves finding the probability distribution of the model 
parameters given a training dataset. For instance, an artificial network training will 
require four preset hyper-parameters for learning from the data: selection of the model 
type with algorithm, selection of the architecture of the network, assignment of training 
parameters, and learning the model parameters. Generally, we can divide the hyper- 
parameters into four decision points before we train the model with data: 


Model type: Decide what type of model you choose in machine 
learning, like feed-forward or recurrent neural network, support 
vector machine, linear regression, etc. 


Architecture: Once you decide the model type, you give inputs 
on what the boundaries of the model learning process are, i.e., 
number of hidden layers, number of nodes per hidden layer, 
batch normalization and pooling layer, etc. 


Training-parameter: Once you decide on the model type and 
architecture, you decide how the model should learn, i.e., 
learning and momentum rate, batch size, etc. These parameters 
are sometimes called training parameter. 


Model parameter: Once you provide these inputs, the model 
training process starts and the model parameters are estimated, 
such as weights and biases in a neural network. 


Some examples of hyper-parameters are: 


Depth of trees or number of leaves 

Latent factors in a matrix factorization 

Learning rate (in neural network based methods) 
Hidden layers in a deep neural network 


Number of clusters in a k-means clustering 


471 


CHAPTER 8 ™ MODEL PERFORMANCE IMPROVEMENT 


To illustrate the effect of hyper-parameters on the model performance, we will create 
a example with different hyper-parameters and check the performance of the model. For 
this example, we will use a subset of the purchase prediction data. 

In the following example, we are creating two random forest models with the same 
underlying data and the same predictor variables, but with two different values for the 
hyper-parameter (the number of trees): 


e ntree = 20 
e ntree = 50 


Here are the accuracy results for both cases: 


setwd("C:/Personal/Machine Learning/Run Chap 8"); 

library (caret) 

library (randomForest ) 

set.seed(917); 

# Load Dataset 

Purchase Data <-read.csw("Purchase Prediction Dataset.csv" ,header=TRUE ) 


#Remove the missing values 
data <-na.omit(Purchase Data) 


#Pick a sample of records 
Data <-data[sample(nrow(data) ,size=10000), | 


e Model 1: with tree size = 20 


Here are the results for the algorithm using 20 trees in the random forest algorithm. 


fit_20 <-randomForest(factor(ProductChoice) ~MembershipPoints +CustomerAge 
+PurchaseTenure +CustomerPropensity +LastPurchaseDuration, 

data=Data, 

importance=TRUE, 

ntree=20) 

#Print the result for ntree=20 

print(fit 20) 


Here are the results for the algorithm using 20 trees in the random forest algorithm. 


Call: 
randomForest(formula = factor(ProductChoice) ~ MembershipPoints 
+ CustomerAge + PurchaseTenure + CustomerPropensity 


+ LastPurchaseDuration, data = Data, importance = TRUE, ntree = 20) 
Type of random forest: classification 
Number of trees: 20 
No. of variables tried at each split: 2 


472 


CHAPTER 8 > MODEL PERFORMANCE IMPROVEMENT 


OOB estimate of error rate: 64.27% 
Confusion matrix: 
1 2 3 4 class.error 
1 550 1035 495 104 0.7481685 
2 730 1927 1051 199 0.5067827 
3 449 1300 1005 165 0.6557040 
4 149 450 300 91 #£0.9080808 


e Model 1 with tree size = 50 


Here are the results for the algorithm using 50 trees in the random forest algorithm. 


fit 50 <-randomForest(factor(ProductChoice) ~MembershipPoints +CustomerAge 
+PurchaseTenure +CustomerPropensity +LastPurchaseDuration, 

data=Data, 

importance=TRUE, 

ntree=50) 

#Print the result for ntree=50 

print(fit 50) 


Call: 
randomForest(formula = factor(ProductChoice) ~ MembershipPoints 
+ CustomerAge + PurchaseTenure + CustomerPropensity 


+ LastPurchaseDuration, data = Data, importance = TRUE, ntree = 50) 
Type of random forest: classification 
Number of trees: 50 
No. of variables tried at each split: 2 


OOB estimate of error rate: 63.35% 
Confusion matrix: 
1 2 3 4 class.error 
1 502 1153 472 57  0.7701465 
2 712 2065 994 136 0.4714615 
3 427 1329 1029 134 0.6474820 
4 147 467 307 69 - 0.9303030 


We can see by just changing the hyper-parameters that the results are different. 
The overall error rate in ntree=50 has come down to 63.35% from 64.27%. Among the 
classification in each class, the classification rate of classes 1 and 4 improved by approx. 
3% while from classes 2 and 3, it decreased. Now the next important question to answer 
is what is the most cost- and time-effective way to find an optimal value of the hyper- 
parameters. 
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8.4 Hyper-Parameter Optimization 


In machine learning, hyper-parameter optimization or model selection is the process of 
choosing a set of hyper-parameters for a machine learning algorithm. The set of hyper- 
parameters that maximize the model performance are then chosen for actual model 
training and testing. Cross validation is generally used for measuring the performance of 
the model in terms of cross validation error rate or some other user-defined method, e.g., 
bootstrap error, leave-one-out, etc. 

In short, learning algorithms learn model parameters that model/fit the input 
data well, while hyper-parameter optimization is to ensure the model does not overfit 
its data by tuning, e.g., regularization. There are multiple algorithms suggested to 
optimize the hyper-parameters of any algorithm. There are multiple popular packages 
and paid services also available to optimize the parameters. Most of them are based on 
some or another variation of the Bayesian approach. We will illustrate the parameter 
tuning by different methods on the same model. This will help you get a comparative 
understanding of how the results change and what can be influencing them. 

The most popular methods are listed here, with some context. We are not providing 
any direct comparison of these methods as the selection of method is influenced by 
many factors, including but not limited to type of model, computation power, time-space 
complexity, etc. 


e Manual search: Create a set of parameters using best judgment/ 
experience and test them on the model. Choose the one that 
works best for the model performance. 


e Manual grid search: Create an equally spaced grid or custom grid 
of a combination of hyper-parameters. Evaluate the mode on each 
grid point and choose the ones with the best model performance. 


e Automatic grid search: Let the program decide a grid for you and 
do the search in that space for the best hyper-parameters, 


e Optimal search: In this method we generally don't freeze the grid 
beforehand, but allow the machine to expand the grid as and 
when needed. 


e Random search: In general, choosing some random points in the 
hyper-parameter search space works faster and better. Although 
this saves lot of spatial and time cost, it might not always give you 
the best/optimal set of hyper-parameters. 


e Custom search: Users can define their own functions and guide 
the algorithm on how to find the best set of hyper-parameters. 


Note Most of these parameter tuning/optimization techniques are search problems 
in high dimensional space. The searching is done on a iterative and guided basis, mostly 
numerical only. The following sections illustrate the popular optimization methods. 
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8.4.1 Manual Search 


The details of the model for manual search optimization are discussed in this section. 
Response Variable: ProductChoice 


Predictors: MembershipPoints, CustomerAge, 
PurchaseTenure, CustomerPropensity, and 
LastPurchaseDuration 


Error Calculation: Cross Validation 


Model Type: Random Forest 


# Manually search parameters 
library (data.table) 

# load the packages 

library (randomForest) 
library (mlbench) 

library (caret) 

# Load Dataset 


dataset <-Data 
metric <- “Accuracy” 


Here, we set the trainControl function with the method=”repeatedCV”, meaning 
use repeated cross validation, and search method = “grid”, meaning search in the grid 
defined by tunegrid. 


# Manual Search 

trainControl <-trainControl(method="repeatedcv", number=10, repeats=3, 
search="grid") 

tunegrid <-expand.grid( .mtry=c(sqrt(ncol(dataset)-2))) 

modellist <-list() 


Here, we set the train function with method=”rf”, meaning use the random forest 
algorithm to fit the model and number of trees as ntree from the loop variables. 


for (ntree in ¢(100, 150, 200, 250)) { 
set.seed(917); 
fit <-train(factor(ProductChoice) ~MembershipPoints +CustomerAge 
+PurchaseTenure +CustomerPropensity +LastPurchaseDuration, data=dataset, 
method="rf", metric=metric, tuneGrid=tunegrid, trControl=trainControl, 
ntree=ntree) 
key <-toString(ntree) 
modellist[[key]] <-fit 
} 
# compare results by resampling 
results <-resamples(modellist) 
#Summary of Results 
summary(results) 
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Call: 
summary.resamples(object = results) 


Models: 100, 150, 200, 250 
Number of resamples: 30 


Accuracy 
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's 
100 0.3880 0.3990 0.4107 0.4081 0.4134 0.4364 0 
150 0.3890 0.3996 0.4117 0.4094 0.4147 0.4390 0 
200 0.3864 0.3974 0.4095 0.4081 0.4139 0.4360 0 
250 0.3884 0.4013 0.4097 0.4090 0.4167 0.4390 0 
Kappa 
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's 
100 0.04385 0.06549 0.08044 0.07803 0.08661 0.1209 0 
150 0.05301 0.06481 0.08014 0.07977 0.08831 0.1235 0 
200 0.04243 0.06214 0.07847 0.07757 0.08953 0.1196 0 
250 0.04427 0.06311 0.08145 0.07873 0.08884 0.1244 0 
#Dot Plot of results 
dotplot(results) 
0.1 0.2 03 04 





0.1 0.2 0.3 0.4 


Accuracy Kappa 
Confidence Level: 0.95 


Figure 8-2. Performance plot accuracy metrics 


You can see the accuracy doesn’t vary much between the different parameter values. 
This can mean that our search is not comprehensive or the model is able to learn most of 
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the features of data in less than the 100-tree random forest model. Also, the independent 
variables list should be increased. 


8.4.2 Manual Grid Search 


The details of the model for manual grid search optimization are discussed in this section. 
Response Variable: ProductChoice 


Predictors: MembershipPoints, CustomerAge, 
PurchaseTenure, CustomerPropensity, and 
LastPurchaseDuration 


Error Calculation: Cross Validation 


Model Type: Learning Vector Quantization (LVQ) 


# Tune algorithm parameters using a manual grid search. 
seed <-917; 
dataset <-Data 


Here, we set the trainControl function with method=”repeatedCV”, meaning use 
repeated cross validation, and the search method = “grid”, meaning search in the grid 
defined by grid. 


# prepare training scheme 

control <-trainControl(method="repeatedcv", number=10, repeats=3) 

# design the parameter tuning grid 

grid <-expand.grid(size=c(5,10,20,50), k=e(1,2,3,4,5)) 

# train the model 

model <-train(factor(ProductChoice) ~MembershipPoints +CustomerAge 
+PurchaseTenure +CustomerPropensity +LastPurchaseDuration, data=dataset, 
method="1vq", trControl=control, tuneGrid=grid) 


# summarize the model 


print (model) 
Learning Vector Quantization 


10000 samples 
5 predictor 
4 classes: '1', ‘2', '3', ‘4' 


No pre-processing 

Resampling: Cross-Validated (10 fold, repeated 3 times) 

Summary of sample sizes: 9001, 9000, 9000, 9001, 8999, 9000, ... 
Resampling results across tuning parameters: 


Size k Accuracy Kappa 
5 1 0.3403649 0.01508857 
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Accuracy was used to select the optimal model using the largest value. 
The final values used for the model were size = 5 and k = 3. 


# plot the effect of parameters on accuracy 
plot (model) 
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Figure 8-3. Accuracy across cross-validated samples 
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The tuning algorithm shows the best tuning parameters Figure 8-3 also shows the 
top line peaking on the accuracy plot, which correspond to the best model. 


8.4.3 Automatic Grid Search 


The details of the model for automatic grid search optimization are discussed in this 
section. 


Response Variable: ProductChoice 


Predictors: MembershipPoints, CustomerAge, 
PurchaseTenure, CustomerPropensity, and 
LastPurchaseDuration 


Error Calculation: Cross Validation 


Model Type: Learning Vector Quantization (LVQ) 


# Tune algorithm parameters using an automatic grid search. 
set.seed(917); 
dataset <-Data 


Here, we set the trainControl function with method=”repeatedCV”, meaning use 
repeated cross validation and the search method being default, i.e., an automatic grid 
search. 


# prepare training scheme 

control <-trainControl(method="repeatedcv", number=10, repeats=3) 

# train the model 

model <-train(factor(ProductChoice) ~MembershipPoints +CustomerAge 
+PurchaseTenure +CustomerPropensity +LastPurchaseDuration, data=dataset, 
method="lvg", trControl=control, tuneLength=5 ) 

# summarize the model 


print (model) 
Learning Vector Quantization 


10000 samples 
5 predictor 
4 classes: '1', ‘2’, '3', ‘4' 


No pre-processing 

Resampling: Cross-Validated (10 fold, repeated 3 times) 

Summary of sample sizes: 9000, 8999, 9001, 9001, 9000, 9000, ... 
Resampling results across tuning parameters: 
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Size k Accuracy Kappa 

11 1 0.3402322 0.03666635 
11 6 0.3402335 0.03518447 
11 11 0.3394009 0.04093310 
11 16 0.3499678 0.04415707 
11 21 0.3444298 0.04208990 
13 1 0.3379881 0.03523337 
13 6 0.3459702 0.04826571 
13 11 0.3464008 0.05010497 
13 16 0.3467346 0.05055072 
13 21 0.3497683 0.05600358 
16 1 0.3313684 0.03813657 
16 6 0.3460655 0.05013518 
16 11 0.3417672 0.04646887 
16 16 0.3502685 0.04977277 
16 21 0.3456003 0.04755585 
19 1 0.3299696 0.03229510 
19 6 0.3392352 0.04576555 
19 11 0.3361026 0.03859754 
19 16 0.3479016 0.05067015 
19 21 0.3451598 0.04997000 
22 1 0.3365661 0.03454596 
22 6 0.3459982 0.04399154 
22 11 0.3441293 0.04592163 
22 16 0.3506335 0.05187679 
22 21 0.3512329 0.05437707 


Accuracy was used to select the optimal model using the largest value. 
The final values used for the model were size = 22 and k = 21. 


# plot the effect of parameters on accuracy 
plot (model) 
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Figure 8-4. Accuracy across cross validated samples for an automatic grid search 


The automatic grid search optimization shows the best model would be with 
parameters of size=22 and k= 21, which corresponds to an accuracy of 0.3512. This differs 
from our manual grid search, where the optimal parameters were size= 5 and k=3, with an 
accuracy of 0.3581. 


8.4.4 Optimal Search 


The details of the model for optimal search optimization are discussed in this section. 
Response Variable: ProductChoice 


Predictors: MembershipPoints, CustomerAge, 
PurchaseTenure, CustomerPropensity, and 
LastPurchaseDuration 


Error Calculation: Cross Validation 
Model Type: Recursive Partitioning and Regression Trees 


Observe the following three expand.grids we used for the tuneGrid parameter in 
the train function. 


e Manual search: expand. grid(.mtry=c(sqrt(ncol(dataset)-2))) 
e Manual grid search: expand.grid(size=c(5,10,20,50), k=c(1,2,3,4,5)) 


e Optimal search: expand.grid(.cp=seq(0,0.1,by=0.01)) 
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In the optimal search, the parameters to expand. grid are more granular, which 
means the algorithm will be able to converge to a global optimum much better than 
the others. For example, by modifying the by = 0.01 in the seq function to have more 
decimal places, you can further increase the granularity. However, keep in mind that 
increasing the granularity will take computational effort. 


# Select the best tuning configuration 
dataset <-Data 


Here, we set the trainControl function with the method=”repeatedCV”, meaning 
use repeated cross validation, and parameter tuning is done on tunegrid. 


# prepare training scheme 

control <-trainControl(method="repeatedcv", number=10, repeats=3) 

# CART 

set.seed(917); 

tunegrid <-expand.grid(.cp=seq(0,0.1,by=0.01) ) 

fit.cart <-train(factor(ProductChoice) ~MembershipPoints +CustomerAge 

+PurchaseTenure +CustomerPropensity +LastPurchaseDuration, data=dataset, 

method="rpart", metric="Accuracy", tuneGrid=tunegrid, trControl=control) 
Loading required package: rpart 


fit.cart 
CART 


10000 samples 
5 predictor 
4 classes: '1', ‘2', ‘3', ‘4' 


No pre-processing 

Resampling: Cross-Validated (10 fold, repeated 3 times) 

Summary of sample sizes: 9000, 8999, 9001, 9001, 9000, 9000, ... 
Resampling results across tuning parameters: 


cp Accuracy Kappa 

0.00 0.3557312 0.05943192 
0.01 0.4014336 0.04179296 
0.02 0.3966989 0.02481739 
0.03 0.3907000 0.00000000 
0.04 0.3907000 0.00000000 
0.05 0.3907000 0.00000000 
0.06 0.3907000 0.00000000 
0.07 0.3907000 0.00000000 
0.08 0.3907000 0.00000000 
0.09 0.3907000 0.00000000 
0.10 0.3907000 0.00000000 
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Accuracy was used to select the optimal model using the largest value. 
The final value used for the model was cp = 0.01. 
# display the best configuration 

print (fit.cart$bestTune) 


cp 
2 0.01 


plot(fit.cart) 
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Figure 8-5. Accuracy across cross validated samples and complexity parameters 


The plot in Figure 8-5 clearly shows the peak of accuracy is at a cp value equal to 0.1, 
which corresponds to an accuracy of 0.41, which is higher than our previous optimized 
models. Also observe our model in this case is Recursive Partitioning and Regression Trees. 


8.4.5 Random Search 


The details of the model for random search optimization are discussed in this section. 
Response Variable: ProductChoice 


Predictors: MembershipPoints, CustomerAge, 
PurchaseTenure, CustomerPropensity, and 
LastPurchaseDuration 


Error Calculation: Cross Validation 


Model Type: Random Forest 
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# Randomly search algorithm parameters 


# Select the best tuning configuration 
dataset <-Data 


Here, we set the trainControl function with method=”repeatedCV”, meaning use 
repeated cross validation, and the predictor search set to random. 


# prepare training scheme 

control <-trainControl(method="repeatedcv", number=10, repeats=3, 
search="random" ) 

# train the model 

model <-train(factor(ProductChoice) ~MembershipPoints +CustomerAge 
+PurchaseTenure +CustomerPropensity +LastPurchaseDuration, data=dataset, 
method="rf", trControl=control) 

# summarize the model 

print (model) 


Random Forest 


10000 samples 
5 predictor 
4 classes: '1', ‘2', '3', ‘q' 


No pre-processing 

Resampling: Cross-Validated (10 fold, repeated 3 times) 

Summary of sample sizes: 9000, 9000, 9002, 9000, 9000, 8999, ... 
Resampling results across tuning parameters: 


mtry Accuracy Kappa 

3 0.4091006 0.07772332 
4 0.3863345 0.08039752 
6 0.3640687 0.06873901 


Accuracy was used to select the optimal model using the largest value. 
The final value used for the model was mtry = 3. 


# plot the effect of parameters on accuracy 
plot (model) 
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Accuracy (Repeated Cross-Validation) 
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Figure 8-6. Accuracy across cross validated sets and randomly selected predictors 


Random search algorithms are usually faster and more efficient in tuning. In 
this case, the plot shows that the algorithm was able to optimize the problem with 
fewer iterations. The random forest model is used in this example. Random forests are 
optimized quickly with random search. This saves lot of time in tuning random forest 
models. 


8.4.6 Custom Searching 


Custom search algorithms provide advanced ways of guiding the algorithm to optimize 
tuning parameters. Advanced users of machine learning can create their own search 
algorithms to optimize hyper-parameters. In this example, we show one such search 
optimization. 

Response Variable: ProductChoice Predictors: MembershipPoints, CustomerAge, 
PurchaseTenure, CustomerPropensity and LastPurchaseDuration Error Calculation: 
Cross Validation Model Type: custom Random Forest 


setwd("C:/Personal/Machine Learning/Chapter 8/"); 

library (caret) 

library (randomForest) 

library (class) 

# Load Dataset 

Purchase Data <-read.csw("Purchase Prediction Dataset.csv",header=TRUE ) 


data <-na.omit(Purchase Data) 
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#Create a sample of 10K records 
set.seed(917); 

Data <-data[sample(nrow(data) ,size=10000), | 
# Select the best tuning configuration 
dataset <-Data 


# Customer Parameter Search 


# load the packages 
library (randomForest ) 
library (mlbench) 
library (caret) 


In this example, we have come up with a custom function for evaluation. The 
algorithm of randomForest is inherited for a classification problem. This is an advanced 
way of creating your own search functions. 


# define the custom caret algorithm (wrapper for Random Forest) 

customRF <-list(type="Classification", library="randomForest", loop=NULL) 
customRF$parameters <-data.frame(parameter=c("mtry", "ntree"), 

class=rep( "numeric", 2), label=e("mtry", "ntree")) 

customRF$grid <-function(x, y, len=NULL, search="grid") {} 

customRF$fit <-function(x, y, wts, param, lev, last, weights, classProbs, ...) { 
randomForest(x, y, mtry=param$mtry, ntree=param$ntree, ...) 

} 

customRF$predict <-function(modelFit, newdata, preProc=NULL, submodels=NULL) 
{ predict(modelFit, newdata)} 

customRF$prob <-function(modelFit, newdata, preProc=NULL, submodels=NULL) { 
predict(modelFit, newdata, type ="prob")} 

customRF$sort <-function(x){ x[order(x[,1]), |} 

customRF$levels <-function(x) {x$classes} 


# Load Dataset 
dataset <-Data 
metric <- “Accuracy” 


# train model 

trainControl <-trainControl(method="repeatedcv", number=10, repeats=3) 
tunegrid <-expand.grid(.mtry=c(1:4), .ntree=e(100, 150, 200, 250)) 
set.seed(917) 

custom <-train(factor(ProductChoice) ~MembershipPoints +CustomerAge 
+PurchaseTenure +CustomerPropensity +LastPurchaseDuration, data=dataset, 
method=customRF, metric=metric, tuneGrid=tunegrid, trControl=trainControl) 
print (custom) 


10000 samples 
5 predictor 
4 classes: '1', ‘2', ‘3', ‘4' 
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No pre-processing 

Resampling: Cross-Validated (10 fold, repeated 3 times) 

Summary of sample sizes: 9000, 8999, 9001, 9001, 9000, 9000, ... 
Resampling results across tuning parameters: 


mtry ntree Accuracy Kappa 

1 100 0.4091336 0.05088226 
1 150 0.4078343 0.04944209 
1 200 0.4082998 0.04973571 
1 250 0.4076663 0.04861050 
2 100 0.4141003 0.07256969 
2 150 0.4145340 0.07306897 
2 200 0.4142334 0.07232983 
2 250 0.4144336 0.07289516 
3 100 0.4090333 0.07980804 
3 150 0.4081328 0.07744357 
3 200 0.4079661 0.07782225 
3 250 0.4086323 0.07818017 
4 100 0.3797990 0.07244785 
4 150 0.3804304 0.07231228 
4 200 0.3826303 0.07566550 
4 250 0.3838646 0.07796204 


Accuracy was used to select the optimal model using the largest value. 


The final values used for the model were mtry = 2 and ntree = 150. 
plot(custom) 
ntree 
100 o 150 © 200 © 250 
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Figure 8-7. Accuracy across cross validated samples and parameter mtry 
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Custom search optimization gives us the highest accuracy of 0.415 so far. For this 
problem this seems to be the best accuracy. Again, to emphasize, we were using the same 
data and the same variable and saw how performance kept on varying. The next section 
will discuss a very important concept in model performance, bias, and variance. 


8.5 The Bias and Variance Tradeoff 


The errors in any machine learning algorithm can be attributed to bias, variance, and 
a irreducible error. The tradeoff or dilemma of bias and variance is the problem of 
minimizing bias and variance simultaneously in any machine learning algorithm. In 
general, reducing one tends to increase the other. 

In performance measurement, we say bias causes underfitting, while variance 
causes overfitting. Figure 8-8 shows a very good graphical representation, provided by 
Scott Fortmann-Roe, in his blog using a bulls eye diagram. 


Low Variance High Variance 


Low Bias 


High Bias 





Figure 8-8. Bias and variance Illustration using the bulls eye plot 
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Fortmann further provides a conceptual definition of errors due to bias and variance. 
Looking at the image in Figure 8-8, it becomes easy to visualize how errors due to bias 
and variance impact results. The simple definition is provided by Fortmann-Roe: 


e Error due to bias: The error due to bias is taken as the difference 
between the expected (or average) prediction of our model and the 
correct value that we are trying to predict. 


e Error due to variance: The error due to variance is taken as the 
variability of a model prediction for a given data point. Again, 
imagine that you can repeat the entire model building process 
multiple times. The variance is how much the predictions for a 
given point vary between different realizations of the model (Source: 
http://scott. fortmann-roe.com/docs/BiasVariance. html). 


The breaking of generalization errors in machine learning algorithms is called bias- 
variance decomposition, and it reduces the errors into three components: 


e Square of bias 
e Variance 
e  IJIrreducible error 


Mathematically, the decomposed equation looks like this: 
F (y- Pw -Bias| f(x) | +Varl /(x)]+0° 
where 
Bias| f(x) ELF (x)- f (x)| 
and 
varl PEELA F HELA] 


The derivation of this equation is also easy and can be done for generalized cases, as 
follows. 


For any random variable, variance is defined as 
Var[X]=E| X? -E[X | 
Equivalently 
E[ X? |= Var[X]+E[ x] 
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assume, f = f(x) and Jaf (x) , as fis deterministic. 


E[f]= f 
Hence, 
y=f+e and Ele ]=0 
imply 
BLy]=ELf+e]=BLs]=S. 
Also, 
Varle ]=o" 

Hence, 


var[y]=8| (y-ELy])’ HELO -AP FE[(f +e -S)HE J Varle]+E[€ | =0° 


Since, cand f are independent, we have 


E (y-f) aad 
ae JLP Jela | 


E ovate i | -2fE| f | 


=Var[y]+Var| f]+(f-E| f P) 
=Var[y]+Var| Î +E f- fy} 


=0° +Var| f |+Bias| f i 


The irreducible error is the noise term in the true relationship that cannot 
fundamentally be reduced by any model. This derivation in the linear regression setup is 
explained in "Notes on Derivation of Bias-Variance Decomposition in Linear Regression," 
by Shakhnarovich, Greg (2011). A similar decomposition is possible in other machine 
learning algorithms. 

Further, the tradeoff is shown here. The graphical representation of this tradeoff also 
gives us an idea as to how to tweak our machine learning algorithms to reach that sweet 
spot where the variance and bias are minimum given this tradeoff constraint. 

The following code snippet shows this tradeoff on a real model prototype. In the 
following example, we calculate mean square error, bias, and variance for hypothetical data, 
and then plot how varying the value of shrink, a number vector, changes these quantities. 
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mu <-2 
Z <-rnorm(20000, mu) 


MSE <-function(estimate, mu) { 
return(sum((estimate -mu)*2) /length(estimate) ) 


n <-100 

shrink <-seq(0,0.5, length=n) 
mse <-numeric(n) 

bias <-numeric(n) 

variance <-numeric(n) 


for (i in 1:n) { 

mse[i] <-MSE((1 -shrink[i]) *Z, mu) 
bias[i] <-mu *shrink[i] 

variance[i] <-(1 -shrink[i])*2 


} 


Now let’s the plot the Bias-Variance tradeoff using the plot function; we can use the 
gen lot function as well. 


# Bias-Variance tradeoff plot 


plot(shrink, mse, xlab='Shrinkage’', ylab='MSE', type='1', col='pink', lwd=3, 
lty=1, ylim=e(0,1.2)) 

lines(shrink, bias*2, col='green', lwd=3, lty=2) 

lines(shrink, variance, col='red', lwd=3, lty=2) 

legend(0.02,0.6, ¢('Bias*2', ‘Variance’, 'MSE'), col=e('green', ‘red’, 
‘pink'), lwd=rep(3,3), Ity=e(2,2,1)) 
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Figure 8-9. Bias versus variance tradeoff plot 
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You can see in the plot in Figure 8-9 that the variance and bias have the opposite 
behavior. The best optimal point for a model exists where the bias and variance meet. 
And this is the point that we try to use for the final model. The early indications of the 
model performance suffering from bias or variance can be seen by fitting the model 
on the test data. Test data is not seen by the model and hence we can measure its true 
performance or error on test data/hold out data. 


e Model suffering from variance: When the model fits well on the train 
data but poorly fits on the test data. This shows that the variability of 
prediction is high and high variance error is dominating. 


e Model suffering from bias: When the model fits poorly on both 
train and test data. The error due to bias is driving the bad 
performance of the model. 


Having a good understanding of the bias-variance tradeoff helps you decide which 
methods can be applied to correct for bias or variance issues in the model. But before 
we jump to the main methods of performance improvements by dealing with bias and 
variance, we list a few common steps that might be taken to improvement the model 
performance: 


e Bring more data into the model 

e Bring in more features 

e Revisit feature selection and create stronger features 

e Regularization methods of feature selection can help 

e Sampling can also be explored (upsample/downsample/resample) 
e Try other learning algorithms 


Once you are satisfied with these steps, you can think of applying them to improve 
model performance. 


8.5.1 Bagging or Bootstrap Aggregation 
This can be used to train the same model on multiple samples, which reduces variance. 
If the modeling is repeated n number of times, i.e., you create your model on n 
samples with each sample independent of the others, you get the variance by a factor of 
n. In other words, if you perform n replications of each configuration and let 
Z,=X\,;-X,, for j=1,2,...,n, 


And since the Z, are independent of identically distributed random variables: 


Var(Z;) 


Var| Z(n) |= 


This shows that developing models on multiple samples will reduce the bias. Monte 
Carlo methods have a detailed theory around this behavior of large sample statistics. 
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8.5.2 Boosting 


Boosting successively models from errors, which reduces bias. Boosting repeatedly 
develops models on the residuals to get better accuracy. For example, the first model 
is developed and it gives 70% accuracy, then the 30% inaccurately predicted cases are 
used to develop another model to bring additional accuracy. This process is repeated 
until there is no improvement in accuracy. After infinite iterations, you are left with an 
irreducible error that contains no additional information. 

We will discuss these methods in more detail after introducing the idea of ensemble 
learning. Ensemble learning is a method of using multiple models to solve a modeling 
problem. Ensemble learning is very effective in reducing the bias and variance of models. 
Another important aspect to keep in mind before we do a deep dive is the complexity 
of the model and production environment. As the model becomes more complex, it 
becomes difficult to interpret and implement in actual business applications. A data 
scientist has to be very careful in choosing the methods to reduce errors, as there is a cost- 
benefit analysis of the degree of improvement. 

In general, we might not get into the decomposition of error, but mostly focus on the 
total error only. A set of data scientists believe that the incremental benefits are not that 
great compared to computational and complexity cost. Instead, we should focus on using 
an accurate measure of prediction error and explore different levels of model complexity 
and then choose the complexity level that minimizes the overall error. 


8.6 Introduction to Ensemble Learning 


The general idea of ensemble learning is better decision making with collective 
intelligence. The ensemble techniques are certainly a game changer in machine learning. 
In statistics and machine learning, ensemble learning means learning from multiple 
algorithms to improve the model performance. 

Generally, the supervised algorithms perform the task of searching for a solution 
in hypothesis/parameter space and finding a suitable hypothesis/parameter that fits 
the problem at hand. As with any search problem, we can’t always find the best solution 
in limited iterations. In such situations, ensembles can be used to combine multiple 
hypotheses to form a (generally) better hypothesis. 

As more than one model is involved in the process of ensemble, they are obviously 
computationally heavy as well as difficult to evaluate on a single parameter. In general, 
fast algorithms are recommended to be used in ensemble methods, e.g., decision tree 
ensembles (randomForest); however, slower algorithm benefit from ensemble methods 
equally. Similarly, you can apply ensemble learning to unsupervised learning algorithms. 
An ensemble learns from underlying models, hence it is itself a supervised learning 
algorithm. 

We will use an example to understand the benefits of ensemble learning by "voting 
ensembles’. 
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8.6.1 Voting Ensembles 


Voting ensembles are the most popular ensemble method in classification problems. 
This ensemble combines the final class results from multiple models and chooses the 
one with the majority vote. It need not be only majority votes; you can weight them 
based on multiple other factors, e.g., individual model performance, complexity, etc. For 
explaining an example of an ensemble, Figure 8-10 is a illustration of majority votes. 


Different Machine 
Leaming Methods 





Final 
Output 






Figure 8-10. Voting ensemble learning for a classification problem 


(Source: Ensemble learning prediction of proteinad&#x0080; “protein interactions using 
proteins functional annotations by Saha,Zubek et.al.) 

Now to help internalize the idea of voting ensembles, let’s understand from a 
hypothetical example, as illustrated here. 


Problem: Finding defective bulbs (=1) in a manufactured lot 
of bulbs 


Ensemble models: We have three inspection experts (read 
models) A, B, and C, to identify defective pieces. You can use 
any one of them or all of them. 


Additional information: Accuracy of A is 0.7, accuracy of B is 
0.6, and accuracy of C is 0.65. Their decision is independent of 
any other decision. 
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We have three binary classifiers models (A, B, and C) with 0.7, 0.6, and 0.65 accuracy, 
respectively. We will now show what happens if all of these models are used together in 
an ensemble model with the majority vote. 

For a majority vote with three models, we can expect four outcomes: 


e All three are correct 

a. 0.7 * 0.68 * 0.65 = 0.3094 
e Two are correct 

a. 0.7* 0.68 *0.35 

b. 0.7 * 0.32 * 0.65 

c. 0.3 * 0.68 * 0.65 = 0.4448 


e Two are wrong 
a. 0.3 *0.32 *0.65 
b. 0.3 * 0.68 * 0.35 
c. 0.7* 0.32 * 0.35 = 0.2122 


e All three are wrong 
a. 0.3 *0.32 * 0.35 = 0.0336 


In scenario 2, we can see that on average, the majority vote ensemble corrects 
for ~44% of the cases. This ensemble of three models will give us an average accuracy 
of ~75.4% (0.4448 + 0.3094), which is more than any individual model. However, the 
important consideration to see this kind of increase is the assumption that the models 
were independent of each other and their prediction was independent of each other. This 
independence condition generally doesn’t hold and hence sometimes you might struggle 
to see improvements in model performance, even with high dimensional ensemble. 


8.6.2 Advanced Methods in Ensemble Learning 


Broadly, there are two types of ensemble helping in variance and bias reduction. There 
are some variants around the same idea like blending, stacking, and custom ensembles, 
but the core idea can be explained by the two methods of bagging and boosting. 


8.6.2.1 Bagging 


Bootstrap aggregation, also called bagging, is a ensemble meta-algorithm. This algorithm 
improves the stability and accuracy of the model and reduces the overfitting issue. This 
method can be used with any method; in cases of continuous functions, it take weighed 
average the output of models, in classification, it weighs output to ensemble into one 
single output. 
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Bagging was proposed by Leo Breiman in 1994 for improving results of a 
classification problem. Details of his original work can be found in his technical paper, 
"Bagging Predictors" Technical Report No. 421, 1994, Department of Statistics. 

Figure 8-11 shows a bagging ensemble flow. The steps in bagging are broadly divided 
into four parts: 


1. Creating samples from training data; number of samples 
should be of appropriate numbers (not too many or too few). 


2. Train the model on individual samples. 
3. Create classifiers from each model and store the results. 


4. Based on the type of ensemble, weighted or majority vote or 
some custom way. Combine the results to predict the test data. 


The image in Figure 8-11 illustrates the four steps in bagging mentioned earlier 
(Source: http: //cse-w iki.unl.edu/). 
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Figure 8-11. Bagging ensemble flow 


Consider these important features of bagging: 


e Each model is developed in parallel and independent of each 
other 


e Helps decrease the variance but ineffective in reducing bias 
e Best suited for high variance, low bias models (complex models) 


e RandomForest is a good example (the randomForest algorithm 
prunes the tree to reduce correlation) 
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8.6.2.2 Boosting 


Similar to bagging, boosting is also a ensemble meta-algorithm meant to reduce bias in 
supervised learning models. Historically, boosting tries to answer the question, suppose 
we have a classifier that always gives a classification less than 50% (weak classifier). 

Can we build a sequence of models to reach zero error (minimal error)? Theoretically, 
this is possible by successively passing the residual to successive models. In general, 

the successive models create so many convoluted relationships in final models that it 
becomes difficult to explain the models; hence, boosting sometimes is known to create a 
black box, something very hard to explain and understand. 

For instance, if you design three-pass boosting, and suppose the classifier is always 
40% correct, then for a set of 100 objects in first pass we will have 60 misclassified. In 
the second pass, it only pass the misclassified objects, so 36% will be misclassified (60% 
of 60). Again in the third pass, you pass the misclassified and get 22% misclassified. So 
essentially, by using a classifier with only 40% accuracy, you can create a ensemble with 
an error equal to 22% (22/100) only, or a model with 78% accuracy. 

However, in reality the theoretical underpinnings remain the same, but the 
improvements are not that dramatic, as many other factors come in to play, e.g., with 
each pass the model becomes weak, reweighing, correlation etc. 

Figure 8-12 shows a boosting ensemble flow. The steps in boosting are described here: 


1. First fita model on a full training dataset, in Figure 8-12, you 
get 42% accuracy in the first model. 


2. Fit another classifier and get 65% accuracy. 
3. Fit the third model to get 92% accuracy. 


4. Now you combine these different classifiers, to form a strong 
classifier. 


H =sign | 0.42 + 0.65 +0.92 
final 





Figure 8-12. Boosting ensemble flow 
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(Source: https: //alliance.seas.upenn.edu). You can see here that the boosted 
machine. i.e., the combined classifier, is performing far better than individual classifiers. 
A few important features of boosting are listed here: 


e Each model is developed sequentially, so each successive model 
is built on the previous model-lacking area. 


e Helps decrease the bias, but is ineffective in reducing variance. 
e Best suited for low variance, high bias models. 


e Gradient boosting machine is a powerful algorithm using 
boosting ensemble. 


In the following sections, we will show one example of each bagging, boosting, 
blending, and stacking on our purchase prediction data. The output tables are easy to 
read and the plot will make the process of model improvement clear. 

Note that the parameters are not tuned for the examples. 


8.7 Ensemble Techniques Illustration in R 


Ensemble training is broadly of two types—bagging and boosting. However, there are 
many other variants researchers have proposed. In this section, we show some examples 
in R using our purchase prediction data. 

This section shows a chunk of R codes, which are reproducible for any dataset 
you want to use. The specific function calls and their options can be accessed in the 
documentation of the Caret package and other dependencies. 

For all of the following examples, there are three important functions to calibrate for 
each of the techniques: 


e  trainControl(): Sets the sampling method, summary, and other 
training parameters. 


e  train(): Trains the models with the trainControl() parameters; 
the modeling method is also defined in this function. 


e Ensemble method: Combines the results from different models 
using custom functions, resample, or caretEnsemble functions. 


Let’s now start building ensemble models using the R environment. 


8.7.1 Bagging Trees 

The two most popular bagging algorithms are used here: 
e Bagged CART (regression tree) 
e Random forest 


The following code creates two models based on these techniques and shows the 
comparison between these two tree methods. 
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library (caret) 

library (randomForest) 

library (class) 

library (ipred) 

# Load Dataset 

Purchase Data <-read.csw("Purchase Prediction Dataset.csv",header=TRUE ) 


data <-na.omit(Purchase Data) 


# Create a sample of 10K records 

set.seed(917); 

Data <-data[sample(nrow(data) ,size=10000), | 

# Select the best tuning configuration 

dataset <-Data 

# Example of Bagging algorithms 

control <-trainControl(method="repeatedcv", number=10, repeats=3) 
metric <- “Accuracy” 


The following code snippet fits a bagged tree model. 


# Bagged CART 

set.seed(917) 

fit.treebag <-train(factor(ProductChoice) ~MembershipPoints +CustomerAge 
+PurchaseTenure +CustomerPropensity +LastPurchaseDuration, data=dataset, 
method="treebag", metric=metric, trControl=control) 


Loading required package: plyr 
Loading required package: e1071 


The following code snippet fits a Random Forest model. 


# Random Forest 

set.seed(917) 

fit.rf <-train(factor(ProductChoice) ~MembershipPoints +CustomerAge 
+PurchaseTenure +CustomerPropensity +LastPurchaseDuration, data=dataset, 
method="rf", metric=metric, trControl=control) 


This summarizes the bagged results from the two methods using the resamples() 
function in the Caret package. 


# summarize results 


bagging results <-resamples(list(treebag=fit.treebag, rf=fit.rf)) 
summary (bagging results) 


Call: 
summary.resamples(object = bagging results) 


Models: treebag, rf 
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Number of resamples: 30 


Accuracy 

Min. 1st Qu. Median Mean 3rd Qu. Max. NA's 
treebag 0.327 0.3444 0.3505 0.3518 0.3583 0.384 0 
rf 0.395 0.4095 0.4167 0.4151 0.4216 0.435 0 


Kappa 

Min. 1st Qu. Median Mean 3rd Qu. Max. NA's 
treebag 0.02242 0.04812 0.05498 0.05786 0.06759 0.1044 0 
rf 0.04252 0.06536 0.07513 0.07387 0.08290 0.1032 0 


dotplot(bagging results) 


The accuracy for the random forest is better than the bagged CART. The plot in 
Figure 8-13 shows the comparison of both algorithms on Kappa and accuracy. 
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Accurac po Kappa 
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Confidence Level: 0.95 


Figure 8-13. Accuracy and Kappa of bagged tree 


8.7.2 Gradient Boosting with a Decision Tree 
For boosting, we will see the two most popular algorithms: 
e (5.0: Decision tree developed by Ross Quinlan 


e Gradient Boosting Machine 
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The following code first creates a C5.0 decision tree model and then a GBM model. Once 
we have both models ready, we create a boosting ensemble with these two models combined. 


library (C50) 
library (gbm) 


dataset <-Data; 

# Example of Boosting Algorithms 

control <-trainControl(method="repeatedcv", number=10, repeats=3) 
metric <- “Accuracy” 


Here, we are fitting a C5.0 decision tree model. 


# C5.0 

set.seed(917) 

fit.c50 <-train(factor(ProductChoice) ~MembershipPoints +CustomerAge 
+PurchaseTenure +CustomerPropensity +LastPurchaseDuration, data=dataset, 
method="C5.0", metric=metric, trControl=control) 

fit.c50 


C5.0 


10000 samples 
5 predictor 
4 classes: '1', ‘2', '3', ‘4' 


No pre-processing 

Resampling: Cross-Validated (10 fold, repeated 3 times) 

Summary of sample sizes: 9000, 8999, 9001, 9001, 9000, 9000, ... 
Resampling results across tuning parameters: 


model winnow trials Accuracy Kappa 


rules FALSE 1 0.3924345 0.07807159 
rules FALSE 10 0.3924345 0.07807159 
rules FALSE 20 0.3924345 0.07807159 
rules TRUE 1 0.4003660 0.03854515 
rules TRUE 10 0.4003660 0.03854515 
rules TRUE 20 0.4003660 0.03854515 
tree FALSE 1 0.3786998 0.06855999 
tree FALSE 10 0.3786998 0.06855999 
tree FALSE 20 0.3786998 0.06855999 
tree TRUE 1 0.3999658 0.03799627 
tree TRUE 10 0.3999658 0.03799627 
tree TRUE 20 0.3999658 0.03799627 


Accuracy was used to select the optimal model using the largest value. 
The final values used for the model were trials = 1, model = rules 


and winnow = TRUE. 


plot(fit.c50) 
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# Boosting Iterations 


Figure 8-14. Accuracy across boosting iterations: C5.0 


The model selects the optimal model using the largest value of accuracy. 
Here, we create a Gradient Boosting Machine (GBM) with the same dataset. 


# Stochastic Gradient Boosting 

set.seed(917) 

fit.gbm <-train(factor(ProductChoice) ~MembershipPoints +CustomerAge 
+PurchaseTenure +CustomerPropensity +LastPurchaseDuration, data=dataset, 
method="gbm", metric=metric, trControl=control, verbose=FALSE) 

Fit.gbm 


Stochastic Gradient Boosting 


10000 samples 
5 predictor 
4 classes: '1', ‘2', ‘3', ‘4' 


No pre-processing 

Resampling: Cross-Validated (10 fold, repeated 3 times) 

Summary of sample sizes: 9000, 8999, 9001, 9001, 9000, 9000, ... 
Resampling results across tuning parameters: 


interaction.depth n.trees Accuracy Kappa 


1 50 0.4133000 0.07395657 
1 100 0.4112656 0.07721806 
1 150 0.4104981 0.07825744 
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2 50 0.4157985 0.08170535 
2 100 0.4138310 0.08341336 
2 150 0.4136634 0.08690728 
3 50 0.4133309 0.08146098 
3 100 0.4117326 0.08628274 
3 150 0.4108320 0.08948114 


Tuning parameter ‘shrinkage’ was held constant at a value of 0.1 


Tuning parameter ‘n.minobsinnode'’ was held constant at a value of 10 

Accuracy was used to select the optimal model using the largest value. 

The final values used for the model were n.trees = 50, interaction.depth 
= 2, shrinkage = 0.1 and n.minobsinnode = 10. 


plot(fit.gbm) 


Max Tree Depth 
2 0 











0.416 


0.415 


0.414 


0.413 


0.412 


0.411 


Accuracy (Repeated Cross-Validation) 





60 80 100 120 140 


# Boosting Iterations 


Figure 8-15. Accuracy across boosting iterations: GBM 


Now we summarize the results by combining the GBM and C5.0 models using the 
resamples() function in the Caret package. 


# summarize results 


boosting results <-resamples(list(c5.0=fit.c50, gbm=fit.gbm) ) 
summary(boosting results) 


Call: 
summary.resamples(object = boosting results) 
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Models: c5.0, gbm 
Number of resamples: 30 


Accuracy 

Min. 1st Qu. Median Mean 3rd Qu. Max. NA's 
c5.0 0.376 0.3917 0.4008 0.4004 0.4088 0.4226 0 
gbm 0.398 0.4112 0.4153 0.4158 0.4209 0.4286 0 


Kappa 

Min. 1st Qu. Median Mean 3rd Qu. Max. NA's 
c5.0 0.00000 0.02886 0.04001 0.03855 0.05701 0.07496 0 
gbm 0.05248 0.07366 0.08241 0.08171 0.08875 0.10530 0 


dotplot(boosting results) 





Accuracy Kappa 
Confidence Level: 0.95 


Figure 8-16. Accuracy across the boosting ensemble 


We can see that the C5.0 algorithm produces an accuracy of 40.5% for the best 
model, while GBM gives a model with 41.5% accuracy. Gradient boosting seems to be 
fitting the data better with the boosting algorithm. 
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8.7.3 Blending KNN and Rpart 


Blending is an ensemble where the output of different models is combined with some 
weights, and all the model output is not treated equally. The following example uses two 
techniques to blend: 


e kmn 
e rpart 


In this example, we will be blending the knn and rpart methods as a linear combination 
of models. The models will be ensembled by using the caretEmseble() function. 
caretEnsemble is a package for making ensembles of Caret models. The details 
of this package can be accessed at https ://cran.r-project.org/web/packages/ 
caretEnsemble/vignettes/caretEnsemble-intro.html. 


Blending (linear combination of models) 


# load libraries 
library(caret) 
library(caretEnsemble) 


library(MASS) 


set.seed(917); 
Data <-data[sample(nrow(data),size=10000),]; 


dataset <-Data; 


dataset$choice <-ifelse(dataset$ProductChoice ==1 |dataset$ProductChoice ==2 
; "A", "B") 


dataset$choice <-as.factor(dataset$choice) 

# define training control 

train control <-trainControl(method="cv", number=4, savePredictions=TRUE, 
classProbs=TRUE ) 

# train a list of models 

methodList <-e('knn', 'rpart' ) 

models <-caretList(choice ~MembershipPoints +CustomerAge +PurchaseTenure 

+CustomerPropensity +LastPurchaseDuration, data=dataset, trControl=train_ 
control, methodList=methodList) 

# create ensemble of trained models 

ensemble <-caretEnsemble(models) 

# summarize ensemble 

summary (ensemble) 


The following models were ensembled: knn, rpart 
They were weighted: 
-1.9876 0.4849 3.4433 
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The resulting Accuracy is: 0.6416 
The fit for each individual model on the Accuracy is: 
method Accuracy AccuracySD 
knn 0.5924004 0.007451753 
rpart 0.6397990 0.005863011 


This output shows that knn and rpart are individually accurate with 59% and 63% 
accuracy, while the blending model is 64% accurate. This shows that blending allows us 
to marginally improve the classification results. In general, the improvements can be even 
in the order of 10%. 

The next methods of stacking are very similar to blending, the only difference is 
that in stacking we will stack models one after another and then weigh output from each 
model to create an ensemble. 


8.7.4 Stacking Using caretEnsemble 


Stacking is similar to blending, the only difference is the way the data is extracted for 
successive models. The general principle is to not use the training data itself for boosting. 
Therefore, we apply rules like using cross-fold validation (the out-of-fold is used to 
train the next layer)—stacking—and/or use a holdout validation (part of the train is used 
in the first layer, part in the second)—blending. 
For example, let’s take the previous example of the knn and rpart models fit for 
ensemble. Assume that the training set had 100 cases to classify. Then in blending: 


1. knn built on 100 cases. 
2. rpart built on 100 cases. 


3. Ensemble model = c1*Knn + c2*Rpart, where c1 and c2 are 
some weights given to each model before combining. This was 
how we blended these two methods. 


The example for stacking will look something like this: 
1. knn built on 100 case, it classifies 60 correctly. 


2. Build rpart on the 40 misclassified cases from previous model, 
which allows you to classify 20 more correctly. (This is an ideal 
situation. In reality the training 100 cases will be weighted in 
a way that the misclassified cases get more weight in training 
than the correctly classified case in the previous mode of the 
stack.) 


3. Now combine the results of the two model runs in ensemble. In 
other words, you stack results from one model to other. 


This example is a simplistic view of how the process of blending and stacking differ 
in principle. In general, both the methods give multiple models which we weigh to 
combine them into a single ensemble model. 
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We can combine (or stack) the predictions of multiple Caret models using the 
caretEnsemble package. In this example, we will stack five different algorithms on our 
purchase prediction data: 


e Linear Discriminate Analysis (LDA) 

e Classification and Regression Trees (CART) 

e Logistic regression (via Generalized Linear Model or GLM) 
e k-Nearest Neighbors (KNN) 


e Support Vector Machine with a Radial Basis Kernel Function (SVM) 


# Example of Stacking algorithms 
library(kernlab) ; 


# create submodels 
control <-trainControl(method="repeatedcv", number=10, repeats=3, 
savePredictions=TRUE, classProbs=TRUE) 


Here are the settings the algorithm lists for stacking. The five algorithms are stored in 
the algorithmList variable which will be used as a parameter in the training function. 


algorithmList <-e('lda', ‘rpart', ‘glm', ‘knn', 'svmRadial') 

set.seed(917) 

models <-caretList(choice ~MembershipPoints +CustomerAge +PurchaseTenure 
+CustomerPropensity +LastPurchaseDuration, data=dataset, trControl=control, 
methodList=algorithmList) 

results <-resamples(models) 

summary (results) 


Call: 
summary.resamples(object = results) 


Models: lda, rpart, glm, knn, svmRadial 
Number of resamples: 30 


Accuracy 

Min. 1st Qu. Median Mean 3rd Qu. Max. NA's 
lda 0.6240 0.6330 0.6443 0.6424 0.6510 0.6600 0 
rpart 0.6260 0.6315 0.6383 0.6403 0.6470 0.6640 0 
glm 0.6270 0.6336 0.6447 0.6432 0.6518 0.6580 0 
knn 0.5710 0.5825 0.5940 0.5908 0.5990 0.6070 0 
svmRadial 0.6226 0.6381 0.6470 0.6462 0.6558 0.6683 0 
Kappa 

Min. 1st Qu. Median Mean 3rd Qu. Max. NA's 
lda 0.14170 0.16630 0.18680 0.18430 0.2038 0.2357 0 
rpart 0.12510 0.15430 0.16750 0.17140 0.1859 0.2319 0 
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glm 0.14290 0.16440 0.18650 0.18430 0.2010 0.2282 0 
knn 0.03609 0.06526 0.09152 0.08519 0.1047 0.1255 0 
svmRadial 0.13230 0.16290 0.18680 0.18370 0.2019 0.2319 O 


dotplot(results) 
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Confidence Level: 0.95 


Figure 8-17. Accuracy and Kappa of individual models 


We can see from the dot plot in Figure 8-17 that the performance has gone up to 
60% by stacking multiple algorithms together. Also note that the model training was very 
resource intensive and model complexity is not suitable for a production environment. 
Now let’s see following the correlation between the results for each of the stacking 
models. The correlation will show how many results were the same across the models. 
If the number of predictions overlapping is high, we might not see any improvement in 
results due to stacking. 


# correlation between results 


modelCor (results) 

lda rpart glm knn svmRadial 
lda 1.00000000 0.65576463 0.974747749 -0.0145770069 0.7366291336 
rpart 0.65576463 1.00000000 0.675976986 -0.0350947255 0.6936118174 
glm 0.97474775 0.67597699 1.000000000 0.0039610564 0.7336378830 
knn -0.01457701 -0.03509473 0.003961056 1.0000000000 -0.0008377878 


svmRadial 0.73662913 0.69361182 0.733637883 -0.0008377878 1.0000000000 


splom(results) 
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Accuracy 





Scatter Plot Matrix 


Figure 8-18. Scatter plot to list correlations among results from stacked models 


Model correlations seem to be high for a few of the models—for instance lda and 
glm, lda and svmradial, etc. This impacts the ensemble power as discussed in the 
previous sections. 

In the previous example, knn was the base model and other models were stacked 
on that. We can actually change the stacking order by using the caretStack() function. 
Here we show the same example by rearranging the stack. In first case we start with the 
glm model and in second we start with a random forest and then will compare results if 
stacking improved the results. 

Stacking using GLM: 


# stack using glm 

stackControl <-trainControl(method="repeatedcv", number=10, repeats=3, 
savePredictions=TRUE, classProbs=TRUE) 

set.seed(917) 

stack.glm <-caretStack(models, method="glm", metric="Accuracy”, 
trControl=stackControl) 

print (stack.g1m) 


A glm ensemble of 2 base models: lda, rpart, glm, knn, svmRadial 
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Ensemble results: 
Generalized Linear Model 


30000 samples 
5 predictor 
2 classes: ‘A’, ‘B' 


No pre-processing 

Resampling: Cross-Validated (10 fold, repeated 3 times) 

Summary of sample sizes: 27000, 26999, 27000, 27000, 27000, 27001, ... 
Resampling results: 


Accuracy Kappa 
0.6441887 0.1845648 


Using glm to stack has given an accuracy of 64%. In the next section, we did the same 
stacking with randomForest. 


# stack using random forest 

set.seed(917) 

stack.rf <-caretStack(models, method="rf", metric="Accuracy", 
trControl=stackControl) 

print(stack.rf) 


A rf ensemble of 2 base models: lda, rpart, glm, knn, svmRadial 


Ensemble results: 
Random Forest 


30000 samples 
5 predictor 
2 classes: ‘A’, ‘B' 


No pre-processing 

Resampling: Cross-Validated (10 fold, repeated 3 times) 

Summary of sample sizes: 27000, 26999, 27000, 27000, 27000, 27001, ... 
Resampling results across tuning parameters: 


mtry Accuracy Kappa 

2 0.6372440 0.1944063 
3 0.6356217 0.1927612 
5 0.6335549 0.1885745 


Accuracy was used to select the optimal model using the largest value. 
The final value used for the model was mtry = 2. 
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Using randomForest, we get an accuracy close to 63.7% which is close to the glm 
accuracy but a little lower. Hence for this experiment, stacking using glm works the best. 
Again, we can re-emphasize that the correlation among some methods is high, so adding 
them to the stack will not benefit the model’s accuracy. 


8.8 Advanced Topic: Bayesian Optimization of 
Machine Learning Models 


In machine learning, hyper-parameter tuning plays a important role. Data scientists 
are now paying attention to tuning the parameters before putting the final model 

in production. Hence it is important to touch briefly on one of the most important 
optimization techniques, called Bayesian optimization. Yachen Yan released a new 
package for Bayesian optimization in R very recently. We will show you how to use this 
package on the house price data. 

Bayesian optimization is a way to find global optimal point for a black box function 
(model evaluation metric as a function of hyper-parameters) without requiring 
derivatives. The work done by Jonas Mockus was well received in the academic 
community; a comprehensive introduction to this topic can be found in "Bayesian 
Approach to Global Optimization: Theory and Applications," Jonas Mockus, Kluwer 
Academic (2013). 

For this example, we will first get an initial set of hyper-parameters by using random 
tuning. This will give us multiple values generated across a wide range. Here we are 
creating 20 random parameters. The example has been inspired by the article by Max 
Kuhn, director at Pfizer on revolutions. The article can be accessed at http: //blog. 
revolutionanalytics.com/2016/06/bayesian-optimization-of-machine-learning- 
models. html. 


setwd("C:/Personal/Machine Learning/Chapter 8/"); 

library (caret) 

library (randomForest) 

library (class) 

library (ipred) 

library (GPfit) 

# Load Dataset 

House price <-read.csw("House Sale Price Dataset.csv",header=TRUE ) 


dataset <-na.omit(House price) 


#Create a sample of 10K records 
set.seed(917); 


rand ctrl <-trainControl(method ="repeatedcv", repeats =5, search ="random") 
rand search <-train(HousePrice ~StoreArea +BasementArea +SellingYear 


+SaleType +ConstructionYear +Rating, data = dataset, method ="svmRadial”, 
Create 20 random parameter values 
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tuneLength =20, 

metric ="RMSE", 

preProc =e("center", "scale"), 
trControl = rand ctrl) 

rand search 


Support Vector Machines with Radial Basis Function Kernel 


1069 samples 
6 predictor 


Pre-processing: centered (10), scaled (10) 

Resampling: Cross-Validated (10 fold, repeated 5 times) 
Summary of sample sizes: 961, 962, 963, 962, 963, 962, ... 
Resampling results across tuning parameters: 


Sigma C RMSE Rsquared 
0.005245534 22.6530619 43909.17 0.7456410 
0.013918538 0.9927528 42284.81 0.7655819 
0.730177279 90.8484676 57009.90 0.5687722 
1.858138939 0.5329669 63431.60 0.4909382 


RMSE was used to select the optimal model using the smallest value. The final values 
used for the model were sigma = 0.04674319 and C = 3.112494. 


ggplot(rand_search) +scale_x_log10() +scale_y_log10() 
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Figure 8-19. RMSE in cost and Sigma space 
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getTrainPerf(1rand_ search) 


TrainRMSE TrainRsquared method 
1 41480.77 0.7706348 svmRadial 


This example is an optimization that assumes the Bayesian model is based on 
Gaussian processes to predict good tuning parameters. Hence, a linear regression type of 
framework is used for this Bayesian analysis. 

For a combination of cost and sigma, we can calculate the bounds of the predicted 
RMSE. Due to the uncertainty of prediction, it is possible to find a better direction for 
optimization. 


# Define the resampling method 
ctrl <-trainControl(method ="repeatedcv", repeats =5) 


Use this function to optimize the model. The two parameters are evaluated on the 
log scale given their range and scope. 


svm fit _bayes <-function(logC, logSigma) { 
Use the same model code but for a single (C, sigma) pair. 
txt <-capture. output ( 
mod <-train(HousePrice ~StoreArea +BasementArea +SellingYear +SaleType 
+ConstructionYear +Rating , data = dataset, 
method ="svmRadial", 
preProc =¢e("center", "scale"), 
metric ="RMSE", 
trControl = ctrl, 
tuneGrid =data.frame(C =exp(logC), sigma =exp(logSigma) ) ) 
) 
The function wants to maximize the outcome so we return 
the negative of the resampled RMSE value. “Pred” can be used 
to return predicted values but we'll avoid that and use zero 
list(Score = -getTrainPerf(mod)[, "“TrainRMSE"], Pred =0) 


} 


Define the bounds of the search. 


lower bounds <-e(logC = -5, logSigma = -9) 

upper bounds <-e(logC =20, logSigma = -0.75) 

bounds <-list(logC =c(lower_bounds[1], upper_bounds[1]), 
logSigma =¢(lower_bounds[2], upper bounds[2])) 


Create a grid of values as the input into the BO code 
initial grid <-rand search$results[, e("C", "sigma", "RMSE") | 
initial grid$C <-log(initial_ grid$Cc) 
initial grid$sigma <-log(initial_ grid$sigma) 
initial _grid$RMSE <--initial_grid$RMSE 

names(initial grid) <-e("logc", "logSigma", "Value") 
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Run the optimization with the initial grid and with 30. 
library (xBayesianOptimization) 


set.seed(917) 
ba search <-BayesianOptimization(svm fit bayes, 
bounds = bounds, 
init grid dt = initial grid, 
init_points =0, 
n iter =30, 
acq = ucb’, 
kappa =1, 
eps =0.0, 
verbose =TRUE) 


20 points in hyperparameter space were pre-sampled 
elapsed = 7.02 Round = 21 logC = -0.6296 logSigma = -3.2325 Value = 
4.260364e+04 


Best Parameters Found: 
Round = 43 logC = 3.5271 logSigma = -3.3272 Value = -4.106852e+04 


ba_search 


$Best Par 
logC logSigma 
3.527062 -3.327152 


$Best Value 
[1] -41068.52 


$History 
Round logC logSigma Value 
1: 1 3.120295026 -5.2503783 -43909.17 
2; 2 -0.007273577 -4.2745337 -42284.81 
49: 49 1.765610990 -2.6130250 -41510.91 
50: 50 3.286583098 -3.4811229 -41876.16 
Round logc logSigma Value 


$Pred 


V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 


1: O O O O O O O O O 0 0 0 0 0 0 0 0 0 0 
V21 V22 V23 V24 V25 V26 V27 V28 V29 V30 
1: o 0o 0 0 0 0 0 0 0 90 


The best values are found as follows: 


Round = 43 

logC = 3.5271 
logSigma = -3.3272 
Value = -4.106852e+04 
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Let's now develop a model with these parameters to see if the optimization did 
actually work. 


final search <-train(HousePrice ~StoreArea +BasementArea +SellingYear 
+SaleType +ConstructionYear +Rating, data = dataset, 

method ="svmRadial", 

tuneGrid =data.frame(C =exp(ba search$Best Par["logC"]), 

sigma =exp(ba search$Best Par["logSigma"])), 

metric ="RMSE", 

preProc =e("center", "scale"), 

trControl = ctrl) 


final search 
Support Vector Machines with Radial Basis Function Kernel 


1069 samples 
6 predictor 


Pre-processing: centered (10), scaled (10) 

Resampling: Cross-Validated (10 fold, repeated 5 times) 
Summary of sample sizes: 962, 961, 964, 961, 964, 963, ... 
Resampling results: 


RMSE Rsquared 
41595.45 0.7671211 


Tuning parameter ‘sigma’ was held constant at a value of 0.0358952 


Tuning parameter 'C' was held constant at a value of 34.02386 


The following command will provide the comparison across the models. The 
comparison is done using one sample t-test. 


compare_models(final search, rand search) 


One Sample t-test 


data: x 
t = 0.061836, df = 49, p-value = 0.9509 
alternative hypothesis: true mean is not equal to 0 
95 percent confidence interval: 
-3612.507 3841.883 
sample estimates: 
mean of x 
114.6878 
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The model fit on the new configuration is comparable to random searches in terms of the 
resampled RMSE and the RMSE on the test set. This shows that the optimization did work well. 


8.9 Summary 


Machine learning models are very complicated when compared to statistical models. 

The machine learning models along with ensemble have increased the complexity of 
models. The models have become difficult to explain and far more difficult to segregate a 
component-wise contribution of features. Ensemble model further adds to complexity in 
relationships of dependent variables and predictor variables quantified by the machine 
learning model. On the other hand, the machine learning algorithm makes it possible to 
use any data in any volume without any assumptions. This makes machine learning stand 
apart from statistical learning and open up bag of opportunities to model virtually any 
data problem. 

One of the major contrasts between statistical learning and machine learning is 
the way both models extract/learn from the given dataset. Machine learning algorithms 
are iterative in nature and depend on some “high-level parameters,” which define the 
complexity of model, learning rate, etc. These parameters are commonly known as hyper- 
parameters. Hyper-parameters impact the model performance to a large extent as they 
define the higher dimension parameters of how the model should learn from the data. We 
learned some methods to optimize these hyper-parameters. All the optimization model 
fitting in this chapter is done using a very power package in R, named Caret, which stands 
for classification and regression training. It can accommodate close to 230+ models in a 
single function call. 

In this chapter, after introducing various types of hyper-parameter optimizations 
methods, we introduced the very important topic of bias and variance tradeoff. This 
tradeoff is a limitation applicable to all statistical models and lies at the heart of 
any model performance optimization problem. The tradeoff states that you cannot 
decrease bias and variance simultaneously. Ensemble methods were then introduced 
to create models that can reduce bias, boosting ensemble, and reduce variance bagging 
ensembles. Bagging and boosting are both powerful techniques, and they have become 
very popular in recent years. 

This chapter also illustrated four very popular ensembles examples using R code, 
bagging, boosting, blending, and stacking. The results are compared and issues like 
correlation in results was also discussed. In the end, we introduced a very advanced 
technique in hyper-parameter optimization, called Bayesian optimization. This is a hot 
topic of research, as the machine learning models have become so huge that a grid search 
is a not feasible solution for hyper-parameter optimization. 

In recent times, the machine learning methods have become computationally 
demanding as well. You can sense the enormity of computational power we require by 
noting the fact that for this chapter we just used a sample of data. Adding more data and 
expanding the search grids can enhance the results. To be able to cater to the demand 
of machine learning algorithms, both with respect to volume of data and computational 
power, we need to explore scalable machine learning infrastructure and algorithms. 

The next chapter introduces the scalable solutions available to practitioners. We 
will introduce concepts of distributed file systems, cluster model training, using Spark, 
parallelization of algorithms, and other issues in scaling up machine learning algorithms. 
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CHAPTER 9 


scalable Machine Learning 
and Related Technologies 





A few years back, you would have not heard the word "scalable" in machine learning 
parlance. The reason was mainly attributed to the lack of infrastructure, data, and 
real-world application. Machine learning was being much talked about in the research 
community of academia or in well-funded industry research labs. A prototype of any real- 
world application using machine learning was considered a big feat and a demonstration 
of breakthrough research. However, time has changed ever since the availability of 
powerful commodity hardware at a reduced cost and big data technology's widespread 
adaption. As a result, the data has become easily accessible and software developments 
are becoming more and more data savvy. Every single byte of data is being captured even 
if its use is not clear in the near future. 

As you witnessed in Chapter 6, the machine learning algorithm has a lot of statistical 
and mathematical depth, but that's not sufficient for it to become scalable. The veracity of 
such statistical techniques is only enough to work on a small dataset that wholly resides 
in one machine. However, when the data size grows big enough to challenge the storage 
capabilities of a single machine, the world of distributed computing and algorithmic 
complexities starts to take over. And in this world, questions like the following start to 
emerge: 


e Does the algorithm run in linear or quadratic time? 
e Do we have a distributed (parallel) version of the algorithm? 


e Dowe have enough machines with required storage and 
computing power? 


If the answers to these questions are yes, you are ready to think big. A very recent 
notion of building data products, which we emphasized in our PEBE machine learning 
process flow, originates from our ability to scale things that can cater to the demand of 
ever-changing technology, data, and increasing number of users of the product. We are 
continuously learning from the incremental addition of new data. 

In this concluding chapter of the book, we will take you through the exciting journey 
of big data technologies like Apache Hadoop, Hive, Pig, and Spark, with special focus 
on scalable machine learning using real-world examples. We will be presenting an 
introduction to these technologies. 
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CHAPTER 9 ™ SCALABLE MACHINE LEARNING AND RELATED TECHNOLOGIES 


9.1 Distributed Processing and Storage 


Imagine a program that uses the most optimized algorithm with the best running time 
(time complexity) and it’s designed for efficient storage as well. However, the notion of 
best running time for a company like Google is few microseconds or even lesser (for its 
search program)and a company involved in DNA sequencing might be willing to spend 
even few days or weeks for the program to complete. Parallel and distributed computing 
before big data revolution started was solving the problem of execution time. The same 
programs were ported to run on multiple machines (servers) at the same time. In other 
words, the program is divided into many subtasks and assigned to multiple machines 
executing it at the same time. The paradigm shift big data brought this way of distributed 
computing was to design a mechanism that efficiently divides the data as well as with the 
program that processes it. The type of problems people thought about in the distributed 
computing era and the big data generation have also seen a quite big makeover. For 
example, problems like the vertex graph coloring problem (finding a way to color the 
vertices of a graph so that no two adjacent vertices share the same color) is considered a 
computationally challenging task even for a small graph with a few vertices. There is lot of 
literature available to designing such distributed programs like the one described in the 
references at the end of the chapter. 

On the other hand, when enormous volumes of data are involved, for example, 
sorting an array of a billion numbers, the big data technologies have found their way 
through the solution. Our focus in this chapter is to highlight some of the technologies in 
this domain using a real-world example. 

Although the evolution of distributed and parallel computing began many decades 
ago, its widespread use has been made possible by two breakthrough works, which led 
to an entire application development and further state-of-the-art technologies. The first 
such breakthrough came from Google in 2003, with their "Google File System" followed by 
"MapReduce: Simplified Data Processing on Large Clusters" in 2004. The former provided 
a scalable distributed file system for large distributed data-intensive applications and the 
latter designed a programming model and an associated implementation for processing 
and generating large datasets. They provide an architecture for dividing and storing a lot 
of data in smaller chunks across thousands of machines (nodes) and taking computations 
locally to the machines with smaller chunks of data than running on the entire data. 

The second breakthrough, which took this technology to the masses, was in 2006, 
with Apache Hadoop, a complete open source framework for distributed storage and 
processing. Hadoop successfully demonstrated that, by using large computer clusters 
built from commodity hardware, it’s possible to achieve reduced computation time and 
automatically handle hardware failures. 


9.1.1 Google File System (GFS) 


The design principle behind GFS was done keeping in mind the demand of data-intensive 
applications. GFS provided the scalable distributed file system (for storage) for large data. 
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In order to precisely emphasize the need of such file system, the following is the 
excerpt from the paper, The Google File System: 


While sharing many of the same goals as previous distributed file systems, 
our design has been driven by observations of our application workloads 
and technological environment, both current and anticipated that reflect 
a marked departure from some earlier file system assumptions. This has 
led us to reexamine traditional choices and explore radically different 
design points. 


The file system has successfully met our storage needs. It is widely deployed 
within Google as the storage platform for the generation and processing 
of data used by our service as well as research and development efforts 
that require large datasets. The largest cluster to date provides hundreds 
of terabytes of storage across thousands of disks on over a thousand 
machines, and it is concurrently accessed by hundreds of clients. 


Handling terabytes of data using thousands of disk over thousands of machines 
speaks for the humongous task such systems are designed to process. 

As shown in Figure 9-1, which was originally published in the paper "Google File 
System," a GFS master stores the metadata about every data chunk stored in GFS chunk 
server. 
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Figure 9-1. A Google file system 


The metadata contains the file and chunk namespace (an abstract container holding 
unique name or identifier), file to chunk mappings, and the location of each chunk’s 
replica for fault tolerance. In the initial design, there was only a single master; however, 
the most contemporary distributed architectures have much more complex settings even 
around the master. The GFS client interacts with the master for metadata requests and all 
data requests go to the chunk servers. 
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9.1.2 MapReduce 


The distributed processing using MapReduce is at the core of how a task on a big 

dataset is divided according to the distributed storage. MapReduce was designed as a 
programming model applying a certain logic, which could range from a sorting operation 
to running a machine leaning algorithm on a large volume of data. 

In a nutshell, as the paper, “MapReduce: Simplified Data Processing on Large Clusters,’ 
explains, users specify a map function that processes a key-value pair to generate a set of 
intermediate key-value pairs, and a reduce function that merges all intermediate values 
associated with the same intermediate key. In simpler terms, you break the data into smaller 
chunks and write a map function to process a <key, value> pair from each of the smaller 
chunks simultaneously in the different nodes. This in turn generates an intermediate <key, 
value> pair, which travels over the network to a central node to get merged by certain logic 
defined by reduce function. The combination of these two is called the MapReduce program. 
We will see a simple example of this in the Hadoop ecosystem section. 

Figure 9-2 from the classic paper, "MapReduce: Simplified Data Processing on Large 
Clusters," shows how the input file that’s split into smaller chunks is placed on workers 
(chunk server) where the map program executes. 
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Figure 9-2. MapReduce execution flow 


Once the map phase has completed its assigned task, it writes the data back into the 
local disk on the chunk servers, which is then picked up by the Reduce program to finally 
output the results. This entire process executes seamlessly even if there are hardware 
failures. We will explain MapReduce in greater detail later in the next section. 
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9.1.3 Parallel Execution in R 


In the CRAN documentation titled, “Getting Started with doParallel and foreach,” by 
Steve Weston and Rich Callaway, the creators of the package doParallel, they explain, 
The doParallel package is a “parallel backend” for the foreach package. It provides a 
mechanism needed to execute foreach loops in parallel. The foreach package must be used 
in conjunction with a package such as doParallel in order to execute code in parallel. 
Foreach is an idiom that allows for iterating over elements in a collection, without the use 
of an explicit loop counter. 

Before we go into some examples of MapReduce and discuss the Hadoop ecosystem, 
let’s see some ways to simulate the random forest algorithm (explained in Chapter 6) 
using parallel execution in multi-core CPUs of a single machine. We will use the credit 
score dataset. 


9.1.3.1 Setting the Cores 


Using the doParallel library in R, we can set the number of cores of the CPU, which 

you want your machine to use in while running the model. There are algorithmic 

ways to decide (beyond the scope of this book) how many cores you should be using 

if a dedicated machine for such processing is available. However, if it’s your personal 
machine, don’t overkill the system by using many cores. Keep in mind that assigning all 
the cores to this process could crash your other processes due to insufficient resources. To 
be safer, we used the c-2 cores, where c is the number of cores available in your machine. 


library (doParallel) 


# Find out how many cores are available (if you don't already know) 
c =detectCores() 
C 
[1] 4 
# Find out how many cores are currently being used 
getDoParWorkers ( ) 
[1] 1 
# Create cluster with c-2 cores 
cl <-makeCluster(c-2) 


# Register cluster 
registerDoParallel (cl) 


# Find out how many cores are being used 


getDoParWorkers ( ) 
[1] 2 
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9.1.3.2 Problem Statement 


The data being used here builds a model, which can predict whether a customer would 
default in repaying the bank loan or not (a binary classifier) using random forest. For 
this demonstration, we are simply looking for the time it takes to build the model when 
executed in serial versus parallel manners. 


Problem : Identifying Risky Bank Loans 
setwd("C:\\Users\\Karthik\\Dropbox\\Book Writing - Drafts\\Chapter Drafts\\ 


Chapter 9 - Scalable Machine Learning and related technology\\Datasets") 
credit <-read.csw("credit.csv") 


str(credit) 
‘data.frame' : 1000 obs. of 17 variables: 
$ checking balance : Factor w/ 4 levels "< 0 DM","> 200 DM",..: 1341 


144343... 
$ months loan duration: int 6 48 12 42 24 36 24 36 12 30... 


$ credit history : Factor w/ 5 levels “critical","good",..: 1212 4 
222 2:1. s 

$ purpose : Factor w/ 6 levels "business", "car",..:5 5452 
45252... 

$ amount : int 1169 5951 2096 7882 4870 9055 2835 6948 3059 
5234 ... 

$ savings balance : Factor w/ 5 levels "< 100 DM","> 1000 DM",..: 5 1 


11154121... 
$ employment_duration : Factor w/ 5 levels "< 1 year","> 7 years",..: 2 3 
44332345... 


$ percent_of income : int 4222323224... 

$ years at_residence : int 4234444242... 

$ age : int 67 22 49 45 53 35 53 35 61 28 ... 

$ other credit : Factor w/ 3 levels "bank", "none",..: 2 2 2 2 2 2 
22.2 2. wax 

$ housing : Factor w/ 3 levels "other","own",..: 222111 
2.3 2:2 ses 

$ existing loans count: int 2111211112... 

$ job : Factor w/ 4 levels “management”, "skilled",..: 2 2 
42242141... 

$ dependents Pant 24222 24 21 1 ee 

$ phone : Factor w/ 2 levels “no","yes": 211112121 
T oraa 

$ default : Factor w/ 2 levels "no", "yes": 121121111 
2 eas 


# create a random sample for training and test data 

# use set.seed to use the same random number sequence as the tutorial 
set.seed(123) 

train sample <-sample(1000, 900) 
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str(train sample) 
int [1:900] 288 788 409 881 937 46 525 887 548 453 ... 
# split the data frames 
credit train <-credit[train sample, | 
credit test <-credit|-train sample, | 


9.1.3.3 Building the model: Serial 


Note the time it takes to execute the random forest model in a serial fashion on the 
training data created. 


Training a model on the data 
library (randomForest) 


#Sequential Execution 
system.time(rf credit model <-randomForest(credit_ train[-17], 
credit train$default, 
ntree =1000) ) 
user system elapsed 
1.8 0.0 1.8 


9.1.3.4 Building the Model: Parallel 


In the parallel version of the code, instead of directly using the random forest model with 
ntree = 1000 parameters (which means build 1000 decision trees), we are going to use the 
foreach function with %dopar%, so we can split the 1000-decision tree building process 

into four processes. Each part builds 250 decision trees using the randomForest function. 


#Parallel Execution 
system. time ( 
rf credit model parallel <-foreach(nt =rep(250,4), 
combine = combine , 
.packages ='randomForest' ) 


»%dopar% 
randomForest( 
credit train|[-17], 
credit train$default, 
ntree = nt)) 


user system elapsed 
0.33 0.09 1.73 
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9.1.3.5 Stopping the Clusters 


Stop all the clusters and resume the execution in a serial fashion. 


#Shutting down cluster - when you're done, be sure to close #the parallel 
backend using 
stopCluster(c1) 


Observe here, approximately, that the parallel execution is 80% faster (it might differ 
based on your system) than the sequential one. If a single machine using multi-cores 
could bring such a huge improvement, imagine the time and resources you'd save when 
using a large computing cluster. 

Notes: 


e The “user time” is the CPU time charged for the execution of user 
instructions of the calling process. 


e The “system time” is the CPU time charged for execution by the 
system on behalf of the calling process. 


In the next section, we go a little deeper into the Hadoop ecosystem and demonstrate 
the first “hello world” example using Hadoop and R. 


9.2 The Hadoop Ecosystem 


There are plenty of resources on Hadoop due to is popularity. Taking a broad view, the 
Hadoop framework consists of the following three modules (the technical details of the 
framework are beyond the scope of this book): 


e Hadoop Distributed File System: This is the storage part of 
Hadoop; the core where the data chunks really reside. Dividing 
data into smaller segments means you need a meticulous way of 
storing the references in the form of metadata and making them 
available to all the processes requiring it. 


e Hadoop YARN: Yet Another Resource Negotiator, this is also 
known as the data operating system. Starting with Hadoop 
2.0, YARN has become the core engine driving the processes 
efficiently by a prudent resource management framework. 


e Hadoop MapReduce: MapReduce decides the execution logic 
of what needs to be done with the data. The logic should be 
designed in such a way that it can execute in parallel with smaller 
chunks of data residing in a distributed cluster of machines. 


On top of this, there are many additional software packages specially designed 
to work on the Hadoop framework, namely Apache Pig, Hive, HBase, Mahout, Spark, 
Sqoop, Flume, Oozie, Storm, Solr, and more. All this software is necessary because of the 
paradigm shift Hadoop brought in the traditional scheme of relational and small scale 
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data. We will take a brief look of Apache Pig, Hive, HBase, and Spark in this chapter, as 
they are the three main pillars of the Hadoop ecosystem. Figure 9-3 shows these tools 
organized in the Hadoop ecosystem. 


Ambari 
Provisioning, Managing and Monitoring Hadoop Clusters 


A By idee 
Linge 
i or ti | Ul =f 


Hadoop Distributed File System (HDFS) 





Figure 9-3. Hadoop components and tools 


We first discuss the MapReduce, which sits in the YARN layer of Hadoop, the 
processing super-head. 


9.2.1 MapReduce 


MapReduce is a programming model for designing parallel and distributed algorithms 
on a cluster of machines. At a broad level, it consists of two procedures. Map, which 
performs operations like filtering and sorting; it processes the key-value pair and 
generates a intermediate key-value pair. Reduce merges all the intermediate values 

with the same key. If a problem could be expressed this way, then it’s possible to use a 
MapReduce to break the problem into smaller parts. Over the years, this model has been 
successfully used in many real-world problems. In order to understand this model, let’s 
look at a simple example of word count. 


9.2.1.1 MapReduce Example: Word Count 


Imagine there is a news aggregator application trying to build an automatic topic 
generator for all their articles in the web. The first step in the topic generator algorithm 
is to build a bag-of-word with their frequencies or, in other words, count the number of 
occurrences of each word in an article. Since there are an enormous number of articles 
on the web, it definitely requires huge computational power to be able to build this topic 
generator. 
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Figure 9-4 shows the MapReduce execution flow as the article is split into many key- 
value pairs, processed by the Map function, which generates the intermediate key-value 
pair of word and a value of 1. Another process called shuffle moves the output of map to 
the Reducer, where finally the values are added for each keyword. 


Input Split Map Shuffle Reduce Output 


Hi All Welcome to Hadoop ——+ 


Hadoop dass integrating ——e 





Figure 9-4. Word count example using MapReduce 


Notes: 
e The example needs a Linux/UNIX machine to run. 
e Appropriate system paths need to be defined by the administrators. 
e Here is the system information in which the code was executed. 
a. platform: 1686-redhat-linux-gnu 
b. arch: i686 
c. os: Linux-gnu 
d. system: i686, Linux-gnu 


e. major:3 


f. minor: 1.2 
g. year: 2014 
h. month: 10 

i. day:31 


svn rev: 66913 


— © 
e 


k. language: R 
l. version.string: R version 3.1.2 (2014-10-31) 
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e The appropriate Hadoop version is required to run the code. This 
code runs on Hadoop version 2.2.0, build 1529768. Comparability 
of this code with the latest version of Hadoop is not checked. 


You must set the environment variable with the location of the Hadoop bin folder 
and the Hadoop streaming JAR. 


Sys.setenv(HADOOP CMD="/usr/lib/hadoop-2.2.0/bin/hadoop" ) 
Sys.setenv(HADOOP STREAMING="/usr/lib/hadoop-2.2.0/share/hadoop/tools/lib/ 
hadoop-streaming-2.2.0.jar") 


Then you install and call the libraries rmr2 and rhdfs. Once they are successful, you 
initialize the HDFS to read or write data from HDFS. 


library (1rm12 ) 
library (rhdfs) 


# Hadoop File Operations 


#initiaglize File HDFS 
hdfs .init() 


Then you put some sample data into the HDFS using the put () function in the rhdfs 
library. 


#Put File into HDFS 
hdfs.put("/home/sample.txt","/hadoop practice") 
[1] TRUE 


The you define the Map and Reduce function. This code snippet defines the way the 
Map and Reduce function are going to scan the text file and tokenize (a term generally 
given to splitting a given sentence or doc by a separator like space) into key-value pairs 
for counting. 


# Reads a bunch of lines at a time 
#Map Phase 
map <-function(k,lines) { 
words. list <-strsplit(lines, '\\s+') 
words <-unlist(words.list) 
return( keyval(words, 1) ) 


#Reduce Phase 
reduce <-function(word, counts) { 
keyval(word, sum(counts) ) 
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#MapReduce Function 
wordcount <-function (input, output=NULL) { 
mapreduce(input=input, output=output, input.format="text", map=map, reduce=reduce) 


The you run the wordcount. The wordcount function we defined is now ready to 
be executed on Hadoop. Before calling the function, ensure that you have set the base 
directory where the input file exists and where you want to put the output generated by 
the wordcount function: 


read text files from folder input on HDFS 
save result in folder output on HDFS 
Submit job 


basedir <- ‘/hadoop practice’ 

infile <-file.path(basedir, 'sample.txt') 
outfile <-file.path(basedir, ‘output') 

ret <-wordcount(infile, outfile) 


Fetch the results. Once the execution of the wordcount function is complete, you 
can fetch the results back into R and convert that into a data frame and sort the results, as 
shown in this code snippet. 


Fetch results from HDFS 
result <-from.dfs(outfile) 
results.df <-as.data.frame(result, stringsAsFactors=F) 
colnames(results.df) <-e('word', ‘count’ ) 
tail(results.df,100) 


word count 
1 R 
2 Hi 
3 to 
4 All 
5 with 
6 class 
7 hadoop 
8 Welcome 
9 integrating 


PRWPPPP PB 


head(results.df[order(results.df$count, decreasing =TRUE), |) 


word count 
7 hadoop 3 
1 R 1 
2 Hi 1 
3 to 1 
4 All 1 
5 with 1 
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Since the entire book is written in R, we have presented this example of word count 
where R integrates with Hadoop using its Hadoop streaming library which is built in the 
packages rhdfc and rmr2. For demonstration sake, such an integration might be fine 
but in a real production system, it might not be a robust solution. Other programming 
languages like Java, Scala, and Python have a robust production level code integrity and a 
tight coupling with the Hadoop framework. In the coming sections, we will introduce the 
basics of Hive, Pig, and Hbase, and conclude with a real-world example using Spark. 


9.2.2 Hive 


The most critical paradigm shift required in terms of adapting to a big data technology 
like Hadoop was the ability to read, write, and transform data as one is familiar doing in 
the Relational Database Management Systems (RDBMS) using SQL (Structured Query 
Language). RDBMS has a well-structured design of tables grouped into databases which 
follow a predefined schema. Querying any table is easy if you follow the SQL syntax, logic, 
and schema properly. The databases are well managed in a data warehouse. 

Now, in order to facilitate such ease of querying the data stored in HDFS, there was 
a need for a data warehouse tool that’s strongly coupled with the HDFS and, at the same 
time, provide the same capabilities of querying like the traditional RDBMS. Apache Hive 
was developed keeping this thought at the center of its design principles. Although the 
underlying storage is HDFS, the data could be structured in a well-defined schema. Among 
all the other tools in the Hadoop ecosystem, Hive is the most used component across 
the industry. The advanced technical discussion on the Hive architecture and design is 
beyond the scope of this book; however, we will present introductory material here in 
order for you to connect with the larger scheme of things when it comes to big data. 

There are many tools in the market that help with large-scale data processing from 
various sources in a company and put it into a common data platform (Hive is a must 
data processing engine in such data platforms), which is then made available across 
companies to analysts, product managers, developers, operations analysts, and so on. 
Qubole Data Service is one such platform offering such a processing service. It also 
provides a GUI for writing SQL queries which runs on Hive. 

Notes: 


e Inthe following demonstrations, we used a Linux virtual 
machine from Cloudera. However, if you have an instance of 
Linux OS installed in your personal systems, you can follow the 
link https: //cwiki.apache.org/confluence/display/Hive/ 
GettingStarted to set up Hive. 


e Alternatively, you could download a virtual machine (also 
called Sandbox) from Cloudera, Hortonworks, or MapR. These 
virtual machines are prebuilt with all the necessary tools 
and components of Hadoop to get you started quickly. Here 
are couple of options. Horton VM: http: //hortonworks. 
com/products/sandbox/ and Cloudera VM: http://www. 
cloudera.com/downloads/quickstart_vms/5-8.html. For the 
demonstrations in this chapter on Pig, Hive, HBase, we used a VM 
from Cloudera. 
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e Wewill use the native command-line interface to show some 
basics of Hive. 


9.2.2.1 Creating Tables 


The query looks very similar to the traditional SQL query; however, what happens in the 
background is a lot different in Hive. Upon successful execution of this query, a new file is 
created in the HDFS in the default database of Hive warehouse (see Figure 9-5). 


training @localhost:~ 





File Edit View Terminal Tabs Help 

|hive> create table emp_info(id int, name string,desig string, sal int,dep string) = 
row format delimited 
fields terminated by ','"; 

| OK 

|Time taken: 0.03 seconds 

| hive> 


Figure 9-5. The Hive create table command 


Figure 9-6 shows the emp _ info table in the folder structure /user/hive/warehouse/ 
of HDFS. 


HOFSuser/hive/warehouse/emp_info - Mozilla Firefox 


Ale Eoit View History Bookmarks Jools Help 


fas y F E] x A [t http: ecalhost iocabdonmain:50075/browseDirectory jsp tdir=fuserhive/warehouse "| C: kd q 
| © HDFS:/usermive/warenouse/em... | + T 





Contents of directory /userfhive/warehouse/emp info 


——,-, 
Geto |jusermhivefwarehousesemp || go | 


Ge të parent directory 


T T T 
Name (Type Size | Replication | Block Size | Modification Time Permission Owner Group 





64 ME | 2016-09-38 GT: 0J Cait le aa training | rupercqroup 














| meinto file |0.13 ra|: 


Ge back to DFS hene 


Figure 9-6. Hive table in HDFS 


9.2.2.2 Describing Tables 


Once the table is created, you can use the describe formatted emp info; command 
to see the structure of the table matching the one we used during creation. Along with 
column name, it also shows the data type of the column (see Figure 9-7). 
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training @localhost:~ 


File Edit View Terminal Tabs Help 
hive> describe formatted emp info; 


OK 

# col name data type comment 
id int None 
name string None 
desig string None 
sal int None 
dep string None 


Figure 9-7. The describe table command 


9.2.2.3 Generating Data and Storing it in a Local File 


The table is now ready to be loaded with some data. We have shown in Figure 9-8 the 
generation of some dummy data and storing it in the local directory in a file named 
emp_info. 


training@localhost:~ 





File Edit View Terminal Tabs Help 
[training@localhost ~)$ cat > emp info 
1, Anne, Admin,50000,A 

2, Gokul,Admin,5606000,B 

3, Janet,5ales,60000,C 

4, Karthik,Analytics, 4454,A 

5, Arun, Some, 3243423 ,6 

6, Nitin, Dev,3232,C 
[training@localhost ~]$ cat emp_info 
1, Anne, Admin,50000, A 

2, Gokul,Admin,S50000,B 

3, Janet,5ales,60008,C 

4, Karthik,Analytics, 4454,4 

5, Arun, Some,3243423,8 
[training@localhost ~]$ 


Figure 9-8. Generate data and store in local file 


9.2.2.4 Loading the Data into the Hive Table 


Once we have the data in a local file, using the command load data local inpath '/ 
home/training/emp info' into table emp info; we will load the data into the Hive 
table emp_info in HDFS. 
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training @locaihost:~ 


File Edit View Terminal Tabs Help 

hive> load data local inpath '/home/training/emp info’ into table emp info; 
Copying data from file:/home/training/emp info 

Copying file: file:/home/training/emp_info 

Loading data to table default.emp info 

OK 

Time taken: 2.456 seconds 

hive> 


Figure 9-9. Load the data into a Hive table 


Figure 9-10 shows the data in the HDFS file that we loaded from the local file system. 


HDFS:/user/hive/warehouse/emp info/emp_ info - Mozilla Firefox 


ble gadt View History Bookmarks Jools Help 





~~ ev © X. A i¢ hitp:-/Aoc ainost. loc aidomain:50075/prowseBlock.|sp?blockid=-792946741792056 x| 4 A 


| © HDFS-/userhive/warehouse/em... ? a 


File: /user/hive/warehouse/emp_info/emp_info 


Goto : |/user/mive/warehouse/emp_| go 


Go back to dir istung 
Advanced Hewl download options 


Anne, Adein,50000,A 
Gokul, Adein, 50000, 8 
Janet, Sales , 60000, C 
, Karthik, Analytics, 4454,A 
Arun, Sore, 3243423,8 


Ue UNE 


Figure 9-10. Data in the HDFS file 


9.2.2.5 Selecting a Query 


Figure 9-11 shows two varieties of the select query. The first one is without a where clause 
and the second one uses where dep = 'A'. Notice how the MapReduce framework built 
into Hadoop comes into play in the Hive query. This is the exact reason why we associate 
tools like Hive with the Hadoop ecosystem. The only difference here, unlike with the 
Word count example, is that we don't have to explicitly define any Map or Reduce 
methods; instead Hive automatically does that for us. 
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training@localhost:~ =—|0 = 
File Edit View Terminal Tabs Help 
hive> set hive.cli.print.header=true; [=] 
hive> select * from emp info; 
OK 
id name desig sal dep 
1 Anne Admin 50000 A 
2 Gokul Admin 56000 B 
3 Janet Sales 66660 Č 
4 Karthik Analytics NULL A 
5 Arun Some 3243423 B 


Time taken: 0.075 seconds 

hive> select * from emp_info where dep = ‘A’; 

Total MapReduce jobs = 1 

Launching Job 1 out of 1 

Number of reduce tasks is set to ð since there's no reduce operator 

Starting Job = job_201609180524 6001, Tracking URL = http://lLocalhost:50030/jobdetails. 
jsp?jobid=job 261609186524 6001 

Kill Command = /usr/lib/hadoop/bin/hadoop job -Dmapred.job.tracker=Localhost: 8021 -Kil 
L job_201609180524 60601 

2016-09-18 67:07:12,732 Stage-1 map = 0%, reduce = 0% 

2016-09-18 07:07:13,746 Stage-1 map = 100%, reduce = 0% 

2016-09-18 67:67:14,774 Stage-1 map = 106%, reduce = 106% 

Ended Job = job_201609180524 6001 





OK 

id name desig sal dep 

1 Anne Admin 50000 A 

4 Karthik Analytics NULL A 

Time taken: 6.38 seconds 

hive> E 


Figure 9-11. Select query with and without a where clause 


Apart from these basic commands, Hive supports data partitioning, table joins, 
multi-inserts, user-defined functions, and data export. These functionality are 
comprehensive enough for analytical databases to be migrated into Hive. 


9.2.3 Apache Pig 


Apache Pig is an analytical platform for large datasets. Pig programs, which are written in 
Pig Latin, are compiled by Pig's infrastructure layer to produce a sequence of MapReduce 
programs, thus achieving parallelism. Its strong coupling with Hadoop provides the 
storage structure of HDFS and process handling by YARN. 

Let’s revisit our wordcount example from the MapReduce section and see how we 
write the same example in a series of Pig Latin commands. For the detailed documentation 
on Pig set and usage, refer to http://pig.apache.org/docs/r0.16.0/start.html. 


9.2.3.1 Connecting to Pig 


The command pig -x local connects to a local file system. Simply using the command 
pig in the terminal will connect to the HDFS. For our word count example, we will stick 
with the local file system 


935 


CHAPTER 9 ™ SCALABLE MACHINE LEARNING AND RELATED TECHNOLOGIES 





training@localihost:~ 


File Edit View Terminal Tabs Help 


[training@localhost ~]$ pig -x local 

2016-09-18 08:22:09,539 [main] INFO org.apache.pig.Main - Logging error messages to: / 
home/training/pig_ 1474212129538. 10g 

2016-09-18 08:22:09,685 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExe 
cutionEngine - Connecting to hadoop file system at: file:/// 

grunt> cat textfile.txt 

Hi All Welcome to Hadoop 

Hadoop class integrating with R Hadoop 

grunt> 


Figure 9-12. Connecting to Pig using local file system 


9.2.3.2 Loading the Data 


The command A1 = load '/home/training/wc.txt' as (line: chararray) ; will scan 
the file and store each line and a character array. The dump A1 command will output the 
following: 


(Hi All Welcome to Hadoop ) 
(Hadoop class integrating with R Hadoop) 





training@locathost:~ 


File Edit View Terminal Tabs Help 


grunt> Al = load ‘/home/training/textfile.txt* as (lLine:chararray) ; S 
grunt> J 


Figure 9-13. Load data into Al 


9.2.3.3 Tokenizing Each Line 


Tokenize each line into a word and store it as a list. The dump A2 command will output the 
following: 


({(Hi), (A11), (Welcome) , (to) , (Hadoop) }) 
({(Hadoop) , (class), (integrating) , (with), (R), (Hadoop) }) 





training@locaihost:~ 


File Edit View Terminal Tabs Help 


grunt> A2 = foreach Al generate TOKENIZE(line) as tokens; 
grunt> 


Figure 9-14. Tokenize each line 
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9.2.3.4 Flattening the Tokens 


The A3 = foreach A2 generate flatten(tokens) as words; command will further 
break each tokenized line into token of words. The dump A3 command will output the 
following: 


(Hi) 
(A11) 
(Welcome) 
(to) 
(Hadoop) 
(Hadoop) 
(class) 
(integrating) 
(with) 
(R) 
(Hadoop) 





trainingĝlocalhost = 


File Edit View Terminal Tabs Help 


grunt> A3 = foreach A2 generate flatten(tokens) as words; a 
grunt> 


Figure 9-15. Flattening the tokens 


9.2.3.5 Grouping the Words 


Using the command A4 = group A3 by words; will create a key-value pair of words and 
the list of the word repeated as many times as it is contained in the tokenized list. The 
dump A4 command will output the following: 


(R, {(R)}) 

(Hi, {(Hi) }) 

(to, {(to)}) 

(A11, {(Al1)}) 

(with, { (with) }) 

(class, {(class) }) 

(Hadoop, { (Hadoop) , (Hadoop) , (Hadoop) }) 
(Welcome, { (Welcome) }) 

(integrating, {(integrating) }) 


training @localhost~ 





File Edit View Terminal Tabs Help 


grunt> A4 = group A3 by words; 
grunt> J | 


Figure 9-16. Group words 
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9.2.3.6 Counting and Sorting 


The following two commands will generate the key-value pair of a word and the number 
of its occurrence in the document and subsequently sort by count. 


e A5=foreach A4 generate group, COUNT(A3); 
e A6=order A5 by $1 desc; 


The dump A6 command will output the following: 


(Hadoop, 3) 
(R,1) 

(Hi, 1) 

(to,1) 

(All ,1) 
(with, 1) 
(class,1) 
(Welcome, 1) 
(integrating ,1) 


Using Pig, many such analytical workflows involving selection, filter, join, union, 
sorting, grouping, and transformation could be created with ease on large datasets. 


9.2.4 HBase 


So far we have been discussing representing data in a structured format of rows and 
columns with predefined schema, which once it’s made, is difficult to tweak for changing 
requirements. In other words, though Hive offered a distributed version of RDBMS on 
large datasets, it still requires you to follow a fixed database schema and store the data in 
warehouse based on it. However, with rapidly changing data we need random, real-time 
read/writes on large distributed data. In such a scenario, the database can't be relational 
anymore; it has to be what people in the big data world call NoSQL. HBase was modeled 
after Google's big table: a distributed storage system for structured data on Google file 
system (GFS). 

Contrary to a traditional RDBMS system, which stores every row of data with all 
its columns even if there are many null values and redundant data across tables due to 
normalization, HBase is a columnar store. This means that each row of data is stored by 
column family. For example, if you have an employee table with column family called 
details, you could store columns like name, age, and qualification under the column 
family details. So if there is anew column address, which could be added under 
details in real-time. 
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9.2.4.1 Starting HBase 


Start the HBase using the shell script start-hbase. sh. Run the following three 
commands: 


1. cd /usr/lib/hbase/ 
2. sudo bin/start-hbase.sh 
3. hbase shell 


9.2.4.2 Creating the Table and Put Data 


The following commands will create a table named employee with two columns called 
details and salary. And in the details column family, it will put the data under the 
name and gender column. 





training @localhost:/usrlib/hbase 


File Edit View Terminal Tabs Help 


(training@localhost 1lib]$ cd /usr/lLib/hbase/ 

[training@localhost hbase]$ sudo bin/start-hbase.sh 

Starting master, Logging to /fusr/lib/hbase/bin/../logs/hbase-root-master-Localhost. localdomain.out 
{training@localhost hbase)$ hbase shell 

HBase Shell; enter ‘help<RETURN>" for list of supported commands. 

Type “exit<RETURN>" to leave the HBase Shell 

Version 6.96,4-cdh3u2, r, Thu Oct 13 26:32:26 POT 2611 


hbase (main) :001:0> 
Figure 9-17. Starting HBase 


1. create ‘employee’, 'details','salary' 

2. put ‘employee’, 'e1', 'details:name', ‘karthik' 
3. put ‘employee','e1','details:gender','‘m' 
4 


put ‘employee’, 'e1', 'salary:sal','20000' 
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training @localhost:/usr/lib/hbase 


file Edit View Terminal Tabs Help 
hbase(main):6807:0> create ‘employee’, ‘details’, ‘salary’ 
ð row(s) in 1.1810 seconds 


hbase(main):608:0> put ‘employee’, ‘el',*details:name’, ‘karthik' 
8 row(s) in 0.0880 seconds 


hbase(main):669:0> put ‘employee’,'el', ‘salary:sal', '26600° 
ð row(s) in 6.6160 seconds 


hbase(main):6160:0> put ‘employee’, ‘el’, ‘details: gender’, ‘'m* 
a row(5) in 8.0110 seconds 


hbase (main) :611:6> | 


Figure 9-18. Create and put data 


9.2.4.3 Scanning the Data 


Using the command scan ‘employee, you can see how the data is stored in HBase. Each 
row corresponds to the column values under a column family. 


training @localhost:/usr/lib/hbase 
Ale Edit View Terminal Tabs Help 
hbase(main):611:0> scan ‘employee’ 





ROW COLUMN+CELL 

el column=details: gender, timestamp=1474220162391, value=m 

êl columnadetails: name, timestamp=1474219898597, valueskarthik 
el column=salary:sal, timestampe14742719919434, value=20608 


1 row(s) in 0.0360 seconds 





hbase(main):612:6> 


Figure 9-19. Scan the data 


A comprehensive reference guide on HBase could be found at http: //hbase. 
apache.org/book.html#arch. overview. 


9.2.5 Spark 


Spark provides lightning-fast cluster computing (similar to distributed computing with 
multiple nodes working together). Spark has an advanced Directed Acyclic Graph (DAG) 
based execution engine which makes it 100 times faster than Hadoop MapReduce in 
RAM or memory and 10 times faster on disk. Contrary to Hadoop, which supports only 
Java, in Spark, you can write applications using Java, Scala, Python, and R. If this was not 
sufficient, Spark also offers SQL, streaming, machine learning, and graph libraries that 
could be combined in any fashion to create an application pipeline. Apart from accessing 
data from HDFS, in Spark, you can connect to HBase, Cassandra, S3, and many more. 

In this chapter, we use SparkR, which is a lightweight front-end offering to use 
Apache Spark from R. It’s light but very rich in functionality. In a nutshell, SparkR 
provides the following functionality: 


e You could create SparkDataFrames from the local data frames or 
hive tables. 
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e On SparkDataFrame operations like selecting, grouping, and 
aggregation as offered by dplyr package in R are possible. 
e Youcan run SQL queries directly on the hive from R. 


e It provides some set of machine learning algorithms from the 
MLlib library of Spark. 


This powerful offering is definitely taking the industry by storm. However, we will 
keep our focus on machine learning library of Spark, MLIlib. 

For interested readers, more details on Spark can be found at http: //spark. apache. 
org/docs/latest/index.html. 


9.3 Machine Learning in R with Spark 


MLlib is Spark's machine learning (ML) library. Its goal is to make practical machine 
learning scalable and easy. At a high level, it provides tools such as: 


e MLalgorithms: Common learning algorithms such as 
classification, regression, clustering, and collaborative filtering 


e Featurization: Feature extraction, transformation, dimensionality 
reduction, and selection 


e Pipelines: Tools for constructing, evaluating, and tuning ML pipelines 
e Persistence: Saving and loading algorithms, models, and pipelines 
e Utilities: Linear algebra, statistics, data handling, etc. 
Currently, SparkR supports the following machine learning algorithms: 
e Generalized Linear Model 
e Accelerated Failure Time (AFT) Survival Regression Model 
e Naive Bayes Model and KMeans Model 


Under the hood, SparkR uses MLIlib to train the model. The following code in R is 
taken from our earlier example of housing price predictions, but this is a scalable version 
of the model using SparkR. 

Note (for Windows users) before running the code, follow these steps: 


1. Download pre-built for Hadoop 2.7 and later Spark release 
from http: //spark.apache.org/downloads.html. 


2. Extract the files into the C:-2.0.0-bin-hadoop2.7 folder (you 
can choose your own location). 


3. Create a symbolic link for the SparkR library using the 
following command in the cmd prompt: mklink /D 
"C:Files-3.2.2" "C:-2.0.0-bin-hadoop2.7". 


4. Using RStudio or the R command line, test using library (SparkR). 
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Let’s go into the R code that follows and understand how SparkR helps build a 
scalable machine learning model with a Spark engine. Keep in mind that the code is 
executed in a standalone Spark cluster with only one node. The true potential of Spark 
could only be seen if the same code runs on a large enterprise cluster of computing nodes 
with Spark. 


9.3.1 Setting the Environment Variable 


The following command will let R know the location where Spark and Hadoop binaries 
are installed in your machine. Remember, both of these are the same environment 
variable as you would have set in your system properties (for Windows machines). 


#Set environment variable 

Sys.setenv(SPARK HOME='C:/Spark/spark-2.0.0-bin-hadoop2.7' ,HADOOP_HOME='C:/ 
Hadoop-2.3.0') 

-libPaths(c(file.path(Sys.getenv('SPARK HOME'), 'R', ‘'lib'),.libPaths()) ) 
Sys.setenv('SPARKR SUBMIT ARGS'='"sparkr-shell"' ) 


9.3.2 Initializing the Spark Session 


Once the environment variables are set, initialize the SparkR session with parameters like 
spark.driver.memory, spark.sql.warehouse. dir, and so on, as shown in the following 
code snippet. This initialization is required in order for the R environment to connect 
with Spark running in the local machine. 


library(SparkR) 
library (1Java) 


#The entry point into SparkR is the SparkSession which connects your R 
program to a Spark cluster 

sparkR.session(enableHiveSupport =FALSE, appName ="SparkR-ML",master 
="local[*]", sparkConfig =list(spark.driver.memory ="1g",spark.sql. 
warehouse. dir="C:/Hadoop-2.3.0")) 

Launching java with spark-submit command C:/Spark/spark-2.0.0-bin- 
hadoop2.7/bin/spark-submit2.cmd --driver-memory "1g" “sparkr-shell” C:\ 
Users\Karthik\AppData\Local\Temp\Rtmpuogh3M\backend_port1030727b704d 

Java ref type org.apache.spark.sql.SparkSession id 1 


9.3.3 Loading Data and the Running Pre-Process 


Load the housing data introduced in Chapter 6 and perform the same set of 
preprocessing steps as shown in the following code snippet: 


library(data.table) 


#Read the housing data 
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Data House Price <-fread("/Users/karthik/Dropbox/Book Writing - Drafts/ 
Chapter Drafts/Chapter 7 - Machine Learning Model Evaluation/tosend/House 
Sale Price Dataset.csv",header=T, verbose =FALSE, showProgress =FALSE) 


str(Data House Price) 


Classes ‘data.table' and ‘data.frame': 


$ HOUSE ID 
$ HousePrice 


$ StoreArea 
$ BasementArea 
$ LawnArea 


$ StreetHouseFront: 
$ Location 
$ ConnectivityType: 
$ BuildingType 


$ ConstructionyYear: 


$ EstateType 
$ SellingYear 


>: chr 
: int 163000 102000 265979 181900 252000 180000 115000 


1300 obs. of 14 variables: 
"0001" "0002" "0003" "0004" ... 


176000 192000 132500 ... 


: int 433 396 864 572 1043 440 336 486 430 264 ... 
: int 662 836 0 594 0 570 O 552 24 588 ... 
: int 9120 8877 11700 14585 10574 10335 21750 9900 3182 


7758 sss 
int 76 67 65 NA 85 78 100 NA 43 NA ... 


: chr “RK Puram” “Jama Masjid" "“Burari® "RK Puram” ... 
chr "Byway" “Byway” “Byway” “Byway” ... 

: chr "“IndividualHouse" "IndividualHouse" 
"IndividualHouse" "IndividualHouse" ... 


int 1958 1951 1880 1960 2005 1968 1960 1968 2004 1962 


> chr "Other" "Other" "Other" "Other" ... 
: int 2008 2006 2009 2007 2009 2006 2009 2008 2010 2007 ... 


$ Rating : int 6476855785... 

$ SaleType : chr “NewHouse" "NewHouse 

- attr(*, ".internal.selfref")=<externalptr> 
#Pulling out relevant columns and assigning required fields in the dataset 
Data House Price <-Data House Price[,.(HOUSE ID,HousePrice, StoreArea, StreetH 
ouseFront , BasementArea, LawnArea, Rating, SaleType) | 


NewHouse" "NewHouse" ... 


#Omit any missing value 
Data House Price <-na.omit(Data House Price) 


Data House Price$HOUSE ID <-as.character(Data House Price$HOUSE ID) 


9.3.4 Creating SparkDataFrame 


Now, create the training and testing SparkDataFrame by splitting the original dataset 
Data House Price into the first two-third. and the rest (the final third).for training and 
testing, respectively. It’s similar to the data frame in R, which helps store any tabular data 
of rows and column, but in Spark its implementation is much more efficient to handle 
network transfers and process thousands of computing nodes. 


#Spark Data Frame - Train 
gausSianDF_ train <-createDataFrame(Data House Price[1:floor(nrow(Data House 
Price)*(2/3)),]) 


#Spark Data Frame - Test 
gausSianDF_ test <-createDataFrame(Data House Price[floor(nrow(Data House 
Price)*(2/3) +1):mrow(Data House Price), |) 
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class(gaussianDF train) 
[1] "SparkDataFrame" 
attr(, "package" ) 

[1] "SparkR" 
class(gaussianDF test) 
[1] "SparkDataFrame" 
attr(, "package" ) 

[1] "SparkR" 


9.3.5 Building the ML Model 


Essentially this is the core of this chapter. The first machine learning model built to scale 
to work with large datasets. spark.g1m is a function in the MLlib library of Spark with a 
scalable implementation of Generalized Linear Model (GLM). Ideally, nothing changes 
as far as the syntax goes (except for the function name), but under the hood, there could 
be large army of nodes working together, automatically running the MapReduce program 
and many other operations supported by Spark to achieve the final outcome. 


# Fit a generalized linear model of family “gaussian” with spark.g1m 
gaussianGLM <-spark.glm(gaussianDF train, HousePrice ~StoreArea 
+StreetHouseFront +BasementArea +LawnArea +Rating +SaleType, family 
="gaussian") 


# Model summary 
summary (gaussianGLM) 


Deviance Residuals: 

(Note: These are approximate quantiles with relative error <= 0.01) 
Min 10 Median 30 Max 

-432276 - 23923 -4236 16522 380300 


Coefficients: 

Estimate Std. Error t value Pr(>|t|) 
(Intercept) -80034 32619 -2.4536 0.014387 
StoreArea 58.172 9.8507 5.9054 5 -4833e-09 
StreetHouseFront 136.98 80.828 1.6947 0.090578 
BasementArea 23.623 3.7224 6.3461 3.9629e-10 
LawnArea 0.77459 0.19875 3.8973 0.0001066 
Rating 35402 1519.4 23.3 0 
SaleType NewHouse -12979 31904 -0.40681 0.68427 
SaleType FirstResale 10117 32497 0.31132 0.75565 
SaleType SecondResale -24563 32480 -0.75626 0.44975 
SaleType ThirdResale -22562 34847 -0.64748 0.51754 
SaleType FourthResale -32205 36778 -0.87567 0.38151 


(Dispersion parameter for gaussian family taken to be 2012650630) 
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Null deviance: 4.9599e+12 on 711 degrees of freedom 
Residual deviance: 1.4109e+12 on 701 degrees of freedom 


AIC: 17286 


Number of Fisher Scoring iterations: 1 


9.3.6 Predicting the Test Data 
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In the final step, you can now predict the house prices on the test dataset using the ML 
model built in the previous step. Refer to Chapter 6 to understand the evaluation criteria 
for this model. 


#Prediction on the gaussianModel 
gausSianPredictions <-predict(gaussianGLM, gaussianDF test) 


names(gaussianPredictions) <-e('HOUSE ID', 'HousePrice', 'StoreArea','StreetH 
ouseFront', ‘BasementArea' , 'LawnArea’, ‘Rating’, SaleType’, ‘ActualPrice’, ‘Pre 
dictedPrice' ) 
gaussianPredictions$PredictedPrice <-round(gaussianPredictions$PredictedPri 
ce,2.0) 


showDF(gaussianPredictions|[ ,9:10]) 
+----------- +-------------- + 
| ActualPrice|PredictedPrice| 

----------- +--------------+ 
139400.0 | 128582.0 | 
157000.0 | 202101.0 | 
178000.0 | 164765.0 | 
120000.0 | 50425.0| 
130000.0| 155841.0| 
582933.0| 333450.0| 
309000.0| 255584.0| 
176000.0| 192695.0 | 
125000.0 | 132784.0 | 
130000.0 | 140085.0 | 
169990.0 | 183082.0 | 
213000.0 | 222965.0| 
144000.0 | 122123.0 | 
118500.0 | 158940.0 | 
138000.0 | 116004.0 | 
437154.0 | 346572.0| 
230000.0| 261396.0| 

82000.0| 61949.0| 
85000.0| 119914.0| 
214900.0| 218930.0| 

----------- +--------------+ 


$ 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
i 


only showing top 20 rows 
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9.3.7 Stopping the SparkR Session 


In the end, when the job is done, execute the following code to free all the resources being 
held for this process, like CPU and memory. 


sparkR.stop( ) 


While this code is running, you can fire up http: //localhost:4040/jobs/ in 
your browser and see the progress of your Spark jobs. For every job that is generated 
automatically upon the execution of this code, you could look at the DAG visualization 
and see how the Spark engine actually carries out the job. 

In order to understand how visualization is built to understand what your 
application is actually doing on the Spark cluster, follow these blog post from databricks: 

https: //databricks.com/blog/2015/06/22/understanding - your -spark- 
application-through-visualization.html 


9.4 Machine Learning in R with H20 


As we are ending this journey of machine learning in this book, we want to introduce one 
more powerful platform for R users, called H20. We have been discussing some powerful 
techniques in machine learning like deep learning, text analysis, ensembles, etc.. These 
techniques are not feasible to be executed on individual machines and need high-power 
computing. 

Ris a popular language and remarkably adaptable to different platforms and it 
has provided options for integrating itself to powerful high-performance computing 
environments. In previous chapters and sections, we showed some examples, like 
Microsoft Cognitive Serves, Spark, and other Apache products. In this last section, we 
introduce H20, which is an open source high performance cluster for big data analysis. 

H20 was developed and maintained by H20.ai, formerly Oxdata, a startup founded 
in 2011. H20 is marketed as "The Open Source In-Memory, Prediction Engine for Big Data 
Science.’ It offers an impressive array of machine learning algorithms. The H20 R package 
provides functions for building GLM, GBM, K-means, Naive Bayes, Principal Components 
Analysis, random forests, and deep learning (multi-layer neural net models). 

H20 is a Java Virtual Machine that is optimized for doing “in-memory” processing 
of distributed, parallel machine learning algorithms on clusters. A “cluster” is a software 
construct that can be fired up on your laptop, on a server, or across the multiple nodes of 
a cluster of real machines, including computers that form a Hadoop cluster. According 
to the latest documentation, the H20 software can be run on conventional operating 
systems like Microsoft Windows (7 or later), Mac OS X (10.9 or later), and Linux (Ubuntu 
12.04; RHEL/CentOS 6 or later). It also runs on big data systems, particularly Apache 
Hadoop Distributed File System (HDFS), and is available on several popular virtual 
machines like Cloudera (5.1 or later), MapR (3.0 or later), and Hortonworks (HDP 2.1 
or later). It also operates on cloud computing environments, for example using Amazon 
EC2, Google Compute Engine, and Microsoft Azure. The H20 Sparkling Water software is 
databricks-certified on Apache Spark. 
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For R, the H20 package is available on CRAN. Before you proceed to the demo of H20, 
we recommend you follow these URLs, which have some well documented materials: 


e Complete documentation on H20 package: https://cran.r- 
project .org/web/packages/h20/h2o0. pdf). 


e Another documentation on H20 is available at the h20.ai as well 
(http: //docs.h20.ai/h20/latest-stable/h20-docs/index.html). 


e More implementation of ML algorithms for H20: (https: // 
github. com/h20ai/h20-3/tree/master/h2o-r/demos) 


e Installation of h2o: A user-friendly and easy to follow description 
of installation is provided here: http: //h20-release. 
S3.amazonaws.com/h20/master/1735/docs-website/Ruser/ 
Rinstall.html 


e A presentation onHigh Performance Machine Learning in R 
with H20 at http://www. stat. berkeley.edu/~ledell/docs/ 
h20_hpccon_oct2015.pdf. 


9.4.1 Installation of Packages 


Once you are done with installing the prerequisites, the following code will fetch the 
latest release of the H20 package for r and install that in the local system. 
Notes: 


e A good Internet connection is recommended before you try this 
code. All computations are performed (in highly optimized Java 
code) in the H20 cluster and initiated by REST calls from R. 


e It’s advisable not to experiment with these codes in your local 
machines with large volume of data (it’s safe to run the demos 
shown in the following code on your local machines). 


# The following two commands remove any previously installed H20 packages for R. 
if ("package:h20" “%in%search()) { detach("package:h20", unload=TRUE) } 
if ("h20" %inzrownames (installed.packages())) { remove.packages("h20") } 


# Next, we download, install and initialize the H20 package for R. 
install. packages("h20", repos=(e("http://s3.amazonaws.com/h2o0-release/h20/ 
rel-kahan/5/R", getOption("repos") ))) 


#Alternatively you can install the package h2o from CRAN as below 
install. packages ("h20") 


9.4.2 Initialization of H2O Clusters 


Once the installation is done, you can fire a instance of clusters for the computation by 
calling the init () function. 
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# Load the h2o library in R 
library(h20) ; 
#Initiate a cluster in your machine 
localH20 =h2o0.init() 
The above function will return an output saying Connection successful as 
shown below: 
Starting H20 JVM and connecting: .... Connection successful! 


R is connected to the H20 cluster: 


H20 cluster uptime: 4 seconds 188 milliseconds 
H20 cluster version: 3.10.0.6 

H20 cluster version age: 1 month and 9 days 

H20 cluster name: H20 started from R_abhisheksingh zve484 
H20 cluster total nodes: 1 

H20 cluster total memory: 0.89 GB 

H20 cluster total cores: 4 

H20 cluster allowed cores: 2 

H20 cluster healthy: TRUE 

H20 Connection ip: localhost 

H20 Connection port: 54321 

H20 Connection proxy: NA 

R Version: R version 3.2.3 (2015-12-10) 


Note: As started, H20 is limited to the CRAN default of 2 CPUs. 
Shut down and restart H20 as shown below to use all your CPUs. 
> h2o.shutdown() 
> h2o0.init(nthreads = -1) 


Once you have initiated a cluster into your local machine, you are ready to run your 
computations on high-power clusters of H20. There are lot of other examples to get you 
started with Gradient Boosting Machine (GBM), Generalized Linear Models (GLM), 
ensemble tress, and many more. 


9.4.3 Deep Learning Demo in R with H20 


The following code runs a built-in demo of deep learning using the demo function with 
the parameter, h20.deeplearning, which internally makes the REST API calls to the local 
H20 cluster. In brief, the code: 


1. Imports a built-in dataset named prostate. csv, parses it, 
and prints a summary. The data was collected by Dr. Donn 
Young at the Ohio State University Comprehensive Cancer 
Center for a study of patients with varying degrees of prostate 
cancer. The goal of this demo was to predict whether a tumor 
has penetrated the prostate capsule based on the variables 
measured at a baseline exam. The metadata is shown in 
Figure 9-20. 
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Variable Description Code/Values Variable Name 
1 Identification Code 1-380 ID 
i Tumour Penetration of Prostatic 0 = No Penetration 
l Capsule 1 = Penetration CAPSULE 
3 Age Years AGE 
1 = White 
4 Race 
O = Black RACE 


1 = No Nodule (Left) 
2 = Unilobar Nodule(Left) 


5 Results of the Digital Rectal Exam 3 = Unilobar Nodule(Right) 
4 = Bilobar Nodule DPROS 
5 Detection of capsular Involvement 1 = No 
in Rectal Exam 2=Yes DCAPS 
7 Prostatic Specific Antigen Value mg/ml PSA 
Tumour Volume obtained from 
j Ultrasound cm cube VOL 


g Total Gleason Score 0-10 GLEASON : 


Figure 9-20. Feature definition of prostate cancer dataset 
2. Then, it runs deep learning on the dataset to predict the 
tumor penetration of the prostate cancer. 


This demo runs H20 on localhost:54321. 


9.4.3.1 Running the Demo 


The function demo runs all at once and outputs the entire output at one go. However, for 
better understanding of what the function does, we have split the output and explained 
each part in detail. 


# Run Deep learning demo 
demo(h20.deeplearning) 


The demo runs. 


9.4.3.2 Loading the Testing Data 


Load the data from the local file system directory of R, where the H20 package is 
installed. It might look like C: \Users\Karthik\Documents\R\win-library\3.2\h20\ 
extdata. 


> prostate.hex = h20.uploadFile(path = system.file("extdata", "prostate. 
csv", package="h20"), destination frame = "prostate.hex") 
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Summary output. The summary output should match the feature definition as per 
Figure 9-20. 


> summary (prostate.hex) 


ID CAPSULE AGE RACE 

Min. 1.00 Min. :0.0000 Min. :43.00 Min. :0.000 
1st Qu.: 95.75 1st Qu.:0.0000 1st Qu.:62.00 1st Qu.:1.000 
Median :190.50 Median :0.0000 Median :67.00 Median :1.000 
Mean :190.50 Mean :0.4026 Mean :66.04 Mean :1.087 
3rd Qu.:285.25 3rd Qu.:1.0000 3rd Qu.:71.00 3rd Qu.:1.000 
Max. 7380.00 Max. 71.0000 Max. :79.00 Max. :2.000 
DPROS DCAPS PSA VOL 

Min. 71.000 Min. 71.000 Min. 0.300 Min. : 0.00 
1st Qu.:1.000 1st Qu.:1.000 1st Qu.: 4.900 1st Qu.: 0.00 
Median :2.000 Median :1.000 Median: 8.664 Median :14.20 
Mean 72.271 Mean 71.108 Mean : 15.409 Mean 715.81 
3rd Qu.:3.000 3rd Qu.:1.000 3rd Qu.: 17.063 3rd Qu.:26.40 
Max. 74.000 Max. 72.000 Max. 7139.700 Max. :97.60 
GLEASON 

Min. :0.000 

1st Qu.:6.000 

Median :6.000 

Mean :6.384 

3rd Qu.:7.000 

Max. >9.000 


Model building. The function h20.deeplearning builds a deep learning model 
using the response variable CAPSULE and the rest of the variable as a predictor. Additional 
parameters are: 


e Hidden, which specifies the hidden layer sizes, 


e Activation, which specifies the type of activation function; the 
demo uses a Tanh function 


e epochs, which directs the neural network with “How many times 
the dataset should be iterated (streamed)” 


> # Set the CAPSULE column to be a factor column then build model. 
> prostate.hex$CAPSULE = as.factor(prostate.hex$CAPSULE ) 


> model = h20.deeplearning(x = setdiff(colnames(prostate.hex), 
c("ID","CAPSULE")), y = "CAPSULE", training frame = prostate.hex, activation 
= "Tanh", hidden = c(10, 10, 10), epochs = 10000) 
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Print the output. The output of the five layer deep neural network is printed by 
accessing the model summary field from the model object created in the previous step. 


> print(model@model$model summary) 

Status of Neuron Layers: predicting CAPSULE, 2-class classification, 
bernoulli distribution, CrossEntropy loss, 322 weights/biases, 8.4 KB, 
3,800,000 training samples, mini-batch size 1 


layer units type dropout l1 l2 mean_rate rate_rms 
1 7 Input 0.00 % 
2 10 Tanh 0.00 % 0.000000 0.000000 0.004538 0.009754 
3 10 Tanh 0.00 % 0.000000 0.000000 0.007007 0.011632 
4 10 Tanh 0.00 % 0.000000 0.000000 0.003262 0.005256 
5 2 Softmax 0.000000 0.000000 0.002906 0.000392 


wm BW DN 


momentum mean weight weight_rms mean bias bias rms 


0.000000 -0.118311 1.642809 -0.152061 1.519672 
0.000000 0.018304 1.594797 -0.470666 0.681625 
0.000000 -0.063209 1.924838 -0.545838 0.903191 
0.000000 0.495293 4.894484 0.012870 2.835105 


Wm BP WN 


Make the prediction. Since the dataset was small, we haven't split the data into 
training or testing datasets, but rather show the predictions on the same dataset used in 
training. However, in cases where sufficient data is available, you are encouraged to run 
the prediction on the testing dataset to better understand the efficacy of the model. 


> # Make predictions with the trained model with training data. 


> predictions = predict(object = model, newdata = prostate.hex) 


> # Export predictions from H20 Cluster as R dataframe. 
> predictions.R = as.data.frame(predictions) 


> head(predictions.R) 
predict po p1 


1 O 9.984036e-01 1.596373e-03 
2 O 9.999973e-01 2.683004e-06 
3 O 9.731078e-01 2.689217e-02 
4 0 9.496504e-01 5.034956e-02 
5 O 9.996701e-01 3.298716e-04 
6 1 4.167409e-07 9.999996e-01 
> tail(predictions.R) 

predict po p1 
375 O 0.999999999 7.078566e-10 
376 O 0.986077940 1.392206e-02 
377 O 0.998982044 1.017956e-03 
378 1 0.008513801 9.914862e-01 
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379 O 1.000000000 5.989944e-11 
380 O 1.000000000 2.681686e-14 


Model evaluation. The accuracy of the model is 99.5%, which is exceptionally good. 
The other measures in the output were discussed in detail throughout Chapter 6. For 
example, MSE, Mean Square Error (MSE), Gini index, and so on. 


> # Check performance of classification model. 
> performance = h20.performance(model = model) 


> print (performance) 

H20BinomialMetrics: deeplearning 

** Reported on training data. ** 

** Metrics reported on full training frame ** 


MSE: 0.01764182 

RMSE: 0.1328225 

LogLoss: 0.0741766 

Mean Per-Class Error: 0.01861449 
AUC: 0.9958826 

Gini: 0.9917653 


Confusion Matrix for Fi-optimal threshold: 
O 1 Error Rate 

0 223 4 0.017621 =4/227 

1 3 150 0.019608 =3/153 

Totals 226 154 0.018421 =7/380 


Maximum Metrics: Maximum metrics at their respective thresholds 
metric threshold value idx 


1 max f1 0.347034 0.977199 114 
2 max f2 0.347034 0.979112 114 
3 max fOpoint5 0.730649 0.983718 106 
4 max accuracy 0.551164 0.981579 110 
5 max precision 1.000000 1.000000 0 
6 max recall 0.007983 1.000000 152 
/ max specificity 1.000000 1.000000 0 
8 max absolute _mcc 0.347034 0.961761 114 
9 max min per class accuracy 0.347034 0.980392 114 
10 max mean per class accuracy 0.347034 0.981386 114 


More demos in the H20 package. Running the following command will list all the 
available demos in H20, which you can run once and then observe how the model 
building process is being followed for the specific ML algorithm. 


demo(package = “h2o) 
Demos in package ‘h20’: 
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h20.anomaly H20 anomaly using prostate cancer data 
h20.deeplearning H20 deeplearning using prostate cancer data 
h20.gbm H20 generalized boosting machines using 
prostate cancer data 

h20.glm H20 GLM using prostate cancer data 

h20.glrm H20 GLRM using walking gait data 

h20.kmeans H20 K-means using prostate cancer data 
h20.naiveBayes H20 naive Bayes using iris and Congressional 
voting data 

h20.prcomp H20 PCA using Australia coast data 
h20.randomForest H20 random forest classification using iris data 


9.5 Summary 


In the days to come, as the cost of infrastructure goes down and data volume increases, 
the need for scaling up will become the first priority in the machine learning process 
flow. Every single application built on machine learning first has to start with the thinking 
of scalable implementation. Most of the traditional RDBMS systems will soon become 
obsolete as the data starts to explode in its size. The giants in the industry have already 
started to take the first step toward migrating to systems that support large scales and 

the agility to change as per business needs. In the not so far in future, a greater emphasis 
on efficient algorithmic designs and focus on subjects like quantum computing will 

start to appear when answers to growing data volume are addressed by another wave of 
disruptive technology. 

We have taken up a comprehensive journey into the world of machine learning by 
drawing the inspiration from the fast growing data science methodology and techniques. 
Though a vast majority of the ML model building process flow exists and is explained 
with much elegance in the classic literature, we felt a need to stich the ML model building 
process flow with the modern world thinking emerging from data science. 

We have also simplified the statistics and mathematics wherever possible to make 
the study of ML more practical and give plenty of additional resources for further 
reading. The depth of topics like sampling, regression models, and deep learning is so 
deep and diverse that each of these topic could produce a book of equal size. However, 
practical applicability of such algorithms were made possible because of the plethora of R 
packages available in CRAN. 

Since R is the preferred programming language for beginners as well as advanced 
users for building quick ML prototypes around a real-world problem, we chose R to 
demonstrate all the examples in the book. If you want to pursue machine learning for 
you career or research work, a fine balance of skillsets in computer science, statistics, and 
domain knowledge will prove to be useful. 
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Information gain, 302 
Initial data analysis (IDA), 31 
description, 41 
dplyr, 43-46 
multiple sources, 43 
naming convention, 42 
str() function, 41 
table(): pattern, 43 
Item-Based Collaborative 
Filtering (IBCF) 
cosine/Pearson correlation, 366 
creation rating matrix, 368 
data preparation, 367 
distribution of ratings, 368 
evaluation, 370 
exploring, rating matrix, 368 
loading data, 367 
raw ratings by users, 369 
true positive ratio vs. 
false positive ratio, 371 
UBCF recommendation 
model, 370 
Iteration error, 375 
Iterative Dichotomizer 3 (ID3) 
algorithm, 304 
commands, 304 
model building, 305 
model evaluation, 305 
RWeka, 304 
RWekajars, 304 


J 


Jaccard index, 354 
JSON file, 38-40 


K 


Kappa error metric, 459-462 

k-fold cross validation, 456-457 

K-Means Clustering Algorithm, 344 

Knowledge Discovery and Data Mining 
(KDD), 4 

Kolmogorov-Smirnov tests (KS test), 253, 
433 

Kurtosis, 59-61 


L 


Law of Large Numbers (LLN), 81 
strong law, 82 
weak law, 82 
Learning Vector Quantization (LVQ), 477, 
479 
Least Absolute Shrinkage and Selection 
Operator (LASSO), 206-207 
LGD. See Loss Given Default (LGD) 
Lift chart, 284 
Linear predictors 
bias of estimator, 239 
consistent estimator, 240 
efficient estimator, 240 
OLS, 239 
Linear regression, 118, 119, 126, 238, 437 
actual vs. predicted, 243, 247 
affine function, 238 
definition, 238 
dependent and independent 
variable, 241 
diagnostics, 242 
estimated equation, 241 
estimation, 242 
Gauss-Markov theorem, 239 
Im() package, 241 
minimization problem, 238 
model diagnostics 
homoscedasticity, 258-261 
influential point analysis, 248-251 
multicollinearity, 254-256 
normality of residuals, 252-254 
outliers, 248 
residual autocorrelation, 256-258 
OLS, 238 
parametric method, 239 
predicted values, 243 
residuals, 242 
standard error, 242 
t-value and p-value, 242-243 
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Line chart 
description, 132 
GDP growth, 
countries, 132, 134 
melt() function, 132-133 
Link function, 266 
List, 20 
Logistic regression 
analysis, 275 
binomial, 265, 269-275 
binomially distributed, 265 
logit transformation, 266-267 
model diagnostics 
bivariate plots, 279-281 
concordance and discordant 
ratios, 284-285 
cumulative gains and lift charts, 
281-284 
deviance, 276 
log likelihoods, 276 
pseudo R-Square, 277 
wald test, 275-276 
multinomial, 265, 285-286, 288-289 
odds ratio, 267-268 
ordered, 266 
predictor variables, 265 
Logit function, 266, 267 
Logit transformation, 266-267 
Loss Given Default (LGD), 185, 187 
LOWESS plot (Locally Weighted 
Scatterplot Smoothing), 237 


Machine learning (ML), 3 
abstraction layer, 219 
algorithms 
ANN, 227 
association rule mining, 226 
Bayesian algorithms, 226 
clustering algorithms, 226 
deep learning, 227 
dimensionality reduction, 228 
distance-based/event-based 
algorithms, 224 
ensemble learning, 228 
regression-based methods, 224 
regularization methods, 225 
text mining, 228 
tree-based algorithms, 225 
case study, 221 
computer vision, 219 
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3D approach 
demo in R, 220 
real-world use case, 220 
statistical background, 220 
distributions, 12 
evaluation, 27 
exploration, 26-27 
feature engineering (see Feature 
engineering) 
friction-less pipeline, 1 
intelligent personal assistant/ 
machines, 219 
PEBE framework, 221 
phase forms, 26 
plethora of algorithms, 1 
predictive models, 219 
process flow, 26 
probability, 12 
conditional independence, 10 
counting, 7-9 
independent events, 9-10 
notation, 6 
statistics, 7 
randomness, 12 
R-package, 221 
statistical concepts, 221 
statistical learning, 1, 220 
statistical modeling, 466-467 
statistics and computer science, 2 
types 
factors, 222 
reinforcement learning, 223 
semi-supervised learning, 223 
supervised learning, 222 
unsupervised learning, 223 
Manual grid search optimization, 477-478 
MapReduce, 520, 522, 523, 526-531, 534, 
535, 540 
Market Basket Data, 232 
Matrix, 20 
Maximum likelihood estimation (MLE), 
267 
Mean, 53, 70 
Mean absolute error, 439-440 
Mean Absolute Percentage Error (MAPE), 
439 
Mean Absolute Scaled Error (MASE), 439 
Microsoft Excel, 34 
Model building checklist, 422-423 
Model evaluation 
continuous output 


962 


mean absolute error, 439-440 
model performance metrics, 
437-438 
RMSE, 441 
R-square, 442-445 
discrete output 
classification matrix, 446-450 
ROC curve, 452-454 
sensitivity and specificity, 451-452 
kappa error metric, 459-462 
population stability index (see 
Population stability index) 
probabilistic techniques (see 
Probabilistic techniques) 
statistical methods, 431-432 
Model performance 
Bayesian optimization, 511-515 
bias and variance tradeoff (see Bias 
and variance tradeoff) 
Caret package, 468-470 
continuous output, 430 
discrete output, 431 
ensemble learning (see Ensemble 
learning) 
evaluation, 431-432 
hyper-parameters (see 
Hyper-parameters) 
machine learning and statistical 
modeling, 466-467 
testing data, 430 
training data, 430 
validation data, 430 
Model performance. See Model evaluation 
Model sampling, 68 
Model-selection process, 247 
Model suffering 
from bias, 492 
from variance, 492 
Moment, 55-56 
Monte Carlo method 
acceptance-rejection methods, 124 
beta density, 125 
EDE 124-125 
random sampling techniques, 124 
stochastic calculus, 124 
Multicollinearity, 254-256 
Multi-Layer Perceptron (MLP), 379 
Multinomial logistic regression 
classifier, 286, 288 
class imbalance, 288 
estimation process, 286 


multinom() function, 286 
probability/proportion, 288 
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Naive Bayes method 
Bayes theorem, 330-331 
chain rule, 332 
conditional probability, 330, 332 
data preparation, 332-333 


likelihood and marginal likelihood, 331 


model, 334 
model evaluation, 335-336 
posterior probability, 331 
prior probability, 331 
purchase prediction dataset, 330 
National Sample Survey Organization 
(NNSO), 70 
Natural Language Processing (NLP), 397, 
417 
Neuron anatomy, 373 
Nonparametric Multiplicative Regression 
(NPMR), 235 
Non-probability sampling, 80 
Not Available (NAs), 47-48 


O 


Online machine learning algorithms 
benefits and challenges, 418 
fuzzy C-means clustering, 419, 421 
tackling, 417 

Optimal search optimization, 481, 483 

Ordinary Least Square (OLS), 238-239 


P 


Particle swarm optimization, 382 
Part-of-speech (POS) 

categorization, 402 

extraction, 405 

frequency, 406 

mapping, 403 

pre-processing, 403-404 
Pearson Product-Moment Correlation 

Coefficient, 235 

Perceptron, 374-376 
Performance evaluation metrics, 270 
Permutation, 9 
Pie charts, 152-154 
Point-of-sale (POS), 354 
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Polynomial regression, 261-265 
Pooled mean, 70 
Pooled variance, 70 
Population stability index 
continuous distribution, 432-433 
discrete cases, 436 
discrete distributions, 437 
ECDF plots, Set_1 and Set_2, 435 
Empirical Cumulative Distribution 
Function (ECDF), 433-434 
KS test, 433, 436 
threshold values, 436 
Principal component analysis (PCA), 228 
advantages, 215 
orthogonality, 215 
steps, 212 
Probabilistic techniques 
bootstrap sampling, 458-459 
K-fold cross validation, 456-457 
Probability 
vs. non-probability sampling, 80 
sampling technique, 79 
data dimensions, 90 
histogram, 91-93 
population mean, 90 
population variance, 91 
sampling methods, 89 
Probability of default (PD), 185 
Pseudo R-Square, 277 
Purposive sampling, 81 


Q 


Quantile, 52-53 
Quota sampling, 81 


R 


building blocks, 18 
calculations, 18 
data frames, 23 
data structures, 19 
functions, 23-25 
GNUS, 18 

lists, 22 

matrixes, 22 
packages, 19 
statistics, 19 
subsetting, 21 
vectors, 21 
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Radial basis function (RBF), 291 
Rand index, 353 
Random Forest, 326-327, 329, 511 
Random search algorithms, 485 
Random search optimization, 483, 485 
rbinom(), 83 
R code, 83-86, 88-89 
Receiver operating characteristic (ROC) 
curve, 274, 452, 451, 455 
Recommendation algorithm, 364 
Recursive binary split, 299 
Recursive partitioning, 481 
Regression analysis 
causation, 234 
distributional assumptions, 233 
linear model, 234 
non-parametric methods, 235 
notation, 234 
parametric methods, 234 
prediction/forecasting, 234 
statistical learning and machine 
learning space, 233 
statistical model, 234 
variables, 234 
Regression-based methods, 224 
Regression trees, 299, 481 
Regularization algorithms, 225 
Reinforcement learning, 223 
Relational Database Management 
Systems (RDBMS), 40, 531 
Residual Sum of Squares (RSS), 299 
Residuals vs. fitted plot, 260 
River plots. See Sankey plots 
RMSE. See Root mean square error 
(RMSE) 
ROC curve. See Receiver operating 
characteristic (ROC) curve 
Root mean square error (RMSE), 441 
Root node, 298 
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Sample point, 71 

Sampling 
bias, 72 
classification, 69 
description, 68 
distribution, 70 
error, 71 
fraction, 72 
objectives, 69 
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population mean, 70 
population statistics, 68 
sources and storing, 67 
technological advancement, 67 
test statistics, 70 
variance, 70 
Sampling without replacement (SWOR), 
72 
Sampling with replacement (SWR), 72 
Sankey plots, 169-170 
Scatterplots 
description, 144 
higher dimensional, 144 
population vs. 

GDP relationship, 145 
Semi-supervised learning, 223 
Serial correlation, 256 
Shapiro-Wilk test, 88 
Sigmoid function, 377 
Sigmoid neurons, 377 
Silhouette coefficient, 352-353 
Simple random sampling 

distribution of data, 96 
function, 94 
histograms, 96 
hypothesis, 96 
KS test, 96, 98 
population, 93 
population average, 95 
population sampling, 94 
population size, 98 
p-value of t.test, 97 
replacement, 98-99 
sample and population, 97 
sample() function, 95 
summarise function, 95 
without replacement, 93 
Simulated annealing, 381 
Simulation, 83-86, 88-89 
Skewness, 57-58 
Spark’s machine learning 
algorithms, 541 
build, ML model, 544 
MLIlib, 541 
preprocessing, 542 
SparkDataFrame creation, 543 
SparkR session, initializing, 542 
sparkR.stop(), 546 
system properties, setting, 542 
test dataset, 545 
tools, 541 


Spatial maps 
data frame creation, 176 
gemap(), 174, 175 
geplot() function, 174 
India map, robbery counts, 177 
Specialization vs. generalization, 379 
Squared Euclidean distance, 291 
Stacked column charts 
age dependency ratio, 139 
contribution, sectors, 139 
description, 138 
working age ratio, 141 
Stacking, 228, 329, 495, 498, 506-511, 516 
Statistical learning, 2-3, 220, 229, 233, 382, 
463, 467, 471, 516 
Stratified random sampling 
disadvantages, 105 
histograms, 109 
KS test, 109 
population, 110 
proportion, 108 
sample() function, 107 
stratified function, 107 
stratified sampling, 105, 107 
stratum variables, 106 
sub-populations, 104 
summarise() function, 109 
t.test(), 109 
Summary statistics, 52 
Supervised learning, 222-223, 228, 307, 
329, 354, 374, 380, 383, 467, 497 
Supervised vs. unsupervised learning, 380 
Support vector machine (SVM) 
binary classifier 
data preparation, 293 
data summary, 293 
model building, 294 
model evaluation, 294-295 
classification, 292 
class separation, 290-291 
hard margins, 292 
linear, 292 
multi-class, 295-297 
nonlinearity, 291 
overlapping classes, 291 
soft margins, 292 
Systematic random sampling 
business and computational capacity, 
104 
circular sampling frame, 100 
EDF, 103 
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formula, 102 
homogeneous sets, 101 
KS test, 103 

population variance, 100 
sample distribution, 104 
sample frame, 102 

skip factor, 100, 102 
subsetting, 101 


T 


Term Frequency/Inverse Term frequency 
(TF_IDF), 400 
Text mining algorithms, 229, 231 
Text-mining approaches 
consumer behavior/product 
performance, 396 
data preparation, 398 
data summary, 397 
Microsoft Cognitive Services 
analytics features, 408 
language detection, 414-416 
mscstexta4r, 409 
Project Oxford, 408 
sentiment analysis, 411-412 
summarization, 416 
third-party API, 407 
topic detection, 412, 414 
twitterR() package, 408 
NLP, 396, 397 
POS tagging, 402-406 
summarization, 398-400 
text analysis, 397 
text data, 396 
TF-IDF, 400, 402 
Twitter statics, 396 
word cloud, 397, 406-407 
Time series graphs, 170 
GDP growth, countries, 170 
GDP growth, recession, 171-172 
Torsten Hothorn, 229 
True Negative Rate (TNR), 451, 452 
True positive rate (TPR), 451, 452 
Twitter feeds and article, 231 


U 


UCI Machine Learning Repository, 
231, 295 

Unsupervised Fuzzy Competitive 
Learning, 418 
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Unsupervised learning, 223, 227, 337, 379, 
380, 383, 389, 467, 493 

User-Based Collaborative Filtering 
(UBCEF), 365-366 


V 


Variable subset selection 
definition, 195 
embedded method 
fit model, 207-208 
fitted Cross Validated Linear 
Model, 209 
glmnet fit model, 208 
logistic regression, 207 
misclassification error and log of 
penalization factor (lambda), 209 
regularization, 206 
statistical approaches, 206 
filter method 
CoV, 196-197 
Gini coefficient, 198 
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statistical approaches, 195 
variance threshold, 195 
wrapper method, 199-205 
Variance, 56-57, 70 
Variance inflation factor 
(VIF), 255, 256 
Vectors, 20-22, 25, 33, 291, 341, 365 


W 


Wald test, 275-276 

Waterfall charts, 162-164 

Within cluster sum of squares (WCSS), 
344, 345 

Wordclouds, 167-168 

World development indicators (WDI), 
50, 132 


X, Y, Z 


XML. See Extensible Markup languages 
(XML) 


